CNstream tutorial
Input files for the CNstream R-package are all in text format. Only one intensity file is required for running CNstream and it can be easily extracted from the Illumina software GenomeStudio. CNstream accept also two optional input files that specify the status of the samples (Case/Control) and the plate number where each sample was genotyped.
1.1. CNstream input signal intensity file
The input signal intensity file is a text file that contains information about each probe (name, chromosome and basepair) together with the channel X and Y intensities of each sample. Each line corresponds to a microarray probe and they must be sorted by chromosome and basepair position.
Name |
Chr |
Position |
SAMPLE1.X |
SAMPLE1.Y |
SAMPLE2.X |
SAMPLE2.Y |
SAMPLE3.X |
SAMPLE3.Y |
… |
rs1545536 |
8 |
144714312 |
0,02148205 |
0,6833881 |
0,2203226 |
0,3893139 |
0,4117542 |
0,4549562 |
… |
rs10099003 |
8 |
144716603 |
1,52746 |
1,342773 |
2,320188 |
0,08941685 |
2,463521 |
0,08725205 |
… |
rs896946 |
8 |
144719785 |
0,02898269 |
2,140413 |
0,06641929 |
1,785213 |
0,04940656 |
2,085741 |
… |
rs11775744 |
8 |
144720335 |
0,01787181 |
0,4492684 |
0,02162173 |
0,249042 |
0,01381066 |
0,3286062 |
… |
rs1809148 |
8 |
144734218 |
0,7682863 |
0,6901037 |
0,5261341 |
0,4613593 |
0,02988189 |
1,034436 |
… |
rs2123758 |
8 |
144734804 |
0,5999851 |
0,6547657 |
0,3712394 |
0,4563591 |
1,006618 |
0,114436 |
… |
rs3793371 |
8 |
144735442 |
0,0316033 |
1,599119 |
0,0900638 |
1,305015 |
0,07485583 |
1,708853 |
… |
rs3793368 |
8 |
144735737 |
1,244978 |
0,0511517 |
1,1884 |
0,04798501 |
1,241068 |
0,04858789 |
… |
rs2382962 |
8 |
144740535 |
1,299303 |
0,06504522 |
0,5841687 |
0,6120613 |
1,179762 |
0,02839759 |
… |
rs4874159 |
8 |
144742093 |
1,517084 |
0,02558302 |
0,5202222 |
0,9074748 |
1,531161 |
0,05225999 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
This file can be easily extracted from GenomeStudio once we have created the project and all the samples that we want to analyze have been processed and included in it. The GenomeStudio project window should look like below and, from now on, we will be working over the tab “Full Data Table”:
The first step consists of selecting the appropriate fields that will be shown in the table. For doing this click the “Column chooser” button inside the “Full Data Table” tab, include the fields “X” and “Y” in the “Displayed Subcolumns” box and the fields “Name”, “Chr” and “Position” together with all the sample IDs that we want to analyze in the “Displayed Columns” box:
Then click “OK” to close the column chooser and click the “Sort by multiple columns” button. Select “Chr” as the first field and “Position” as the second field.
If we only want to analyze one genome region, we can use the button “Filter rows”. In our case, we have selected the chromosome 8 region 15,400,000-15,500,000 as you can see in the figure below.
Once we have selected the correct fields, sorted the probes by basepair position, we can proceed to create the input signal intensity file for CNstream. Click the “Export displayed data to a file” button and save the file in your working directory.
TIP: Depending on the number of samples and the array used in your project, GenomeStudio can spend a lot of time creating this file. A tip for considerably reduce this time is sorting the probes using the “Index” field and, once the file has been created, use a Python or a Perl script to sort the probes by chromosome and basepair position. For creating the file sorted by “Index” you should select in the “Column chooser” panel the field “Index” to be displayed in the table, then use the “Sort by multiple columns” panel to sort the probes by “Index”. Once the probes have been sorted, go back to the “Column chooser” panel and delete the field “Index” of the “Displayed columns” box in order to delete this column of the table.
1.2. CNstream input status file
This file is only mandatory if we want to perform also a Case/Control analysis of disease risk association. If we provide this file, CNstream will perform a chi-square test for each CNP segment and will include in the results file some informative fields as the P-value. Status file is a plain text file where each line corresponds to the status of one sample (0 for controls and 1 for cases). The samples must be sorted in the same way that in the input signal intensity file. Then, the first line of the status file will correspond to the status of the first sample in the signal intensity file (first row) and so on.
1.3. CNstream input plate file
This file is only mandatory if we
want to perform plate normalization. It follows the same rules than the input
status file but, in this case, each line corresponds to
the plate number of the sample. Plate number “
Once CNstream package has been installed and loaded, we can start to use it by calling the main CNstream function CN.stream(). By typing the command help(CN.stream) we will obtain detailed information about the function. Optional parameters of CNstream have been optimized for Illumina array data, then we can start to run it without modifying them. The basic analysis includes neither plate normalization nor association test, then only the input signal intensity file is required and the path where the results will be saved:
> data <- "http://www.urr.cat/cnv/data/ex.txt" |
> output_dir <- "C:/" |
> CN.stream(data, output_dir) |
******************************************************** ******************************************************** Analysis started at Thu Sep 1
13:36:06 2009 Input data file (X, Y): http://www.urr.cat/cnv/data/ex.txt Copy number scores: C://CNstream_scores.txt Segment calls: C://CNstream_CNPseg.txt Number of samples: 572 ******************************************************** ******************************************************** |
If we want to include the plate normalization file and the status file:
> data <- "http://www.urr.cat/cnv/data/ex.txt" |
> output_dir <- "C:/" |
> plate <- "http://www.urr.cat/cnv/data/plate.txt" |
> status <- "http://www.urr.cat/cnv/data/status_all.txt" |
> CN.stream(data, output_dir,
norm_plate = plate, status = status) |
Once the analysis has finished, the results are saved in the output directory (in this case, “C:/”):
CNstream_scores.txt |
Single-locus scores for each sample at each probe |
CNstream_CNPseg.txt |
CNV segment calls for each sample and other relevant information, as the percentages of amplifications and deletions, and the P-Value and the OR when the status file is provided. |
An interesting option, when we are analyzing a few number of probes, is verbose = 1. Including this option as an input parameter, CNstream will screen the genotyping results and the single-locus scoring results for each probe:
> CN.stream(data, output_dir,
norm_plate = plate, status = status, verbose = 1) |
Processing probe number 0 : rs6545625 in chromosome 2 basepair
57304896 GENOTYPING Samples per genotype: 134 , 228 , 126 Plot OK... Press ENTER to
continue. CNV SCORING Plot channel A OK... Press ENTER to continue. Plot channel B OK... Press
ENTER to continue. Processing probe number 1 : rs1424627 in chromosome 2 basepair
57307915 GENOTYPING Samples per genotype: 303 , 157 , 28 Plot OK... Press ENTER to
continue. CNV SCORING Plot channel A OK... Press ENTER to continue. Plot channel B OK... Press
ENTER to continue. Processing probe number 2 : rs7576091 in chromosome 2 basepair
57308888 GENOTYPING Samples per genotype: 39 , 189 , 259 Plot OK... Press ENTER to
continue. CNV SCORING Plot channel A OK... Press ENTER to continue. Plot channel B OK... Press ENTER to continue. |
CNstream default parameters can be modified by using the input option, which is defined as a list with the following parameters. We only need to specify the field that we wish to change.
option$segment_length |
Maximum segment length allowed in Kb (default = 100) |
option$nmarkers |
Number of probes per segment (default = 5) |
option$minmarkers |
Minimum number of probes in one segment that must exceed the amplification/deletion threshold for calling an amplification/deletion (default = 3) |
option$fr |
CNV frequency threshold. Only segments exceeding this frequency will be saved (default = 1%) |
option$LI |
Deletion threshold (default = 1.65) |
option$LS |
Amplification threshold (default = 2.7) |
option$pdfplot |
If pdfplot != 1, detailed figures corresponding to the CNP segments will be saved in the output path as PDF documents (default = -1) |
Example: If we want the CNP segment figures to be saved, we proceed as follows:
> option <- list(pdfplot = 1) |
> CN.stream(data, output_dir,
norm_plate = plate, status = status, option =
option) |
For each CNP segment a PDF file will be created (Example file).
CNstream creates two output files that resume the CNP analysis results and that are saved in the path specified by the input variable ouput_dir.
Þ “CNstream_scores.txt”
Contains the scores assigned to each sample at each locus. These scores are computed at the single-locus scoring step and they are subsequently used to call the copy numbers.
Name |
Chr |
Position |
Sample_1 |
Sample_2 |
Sample_3 |
Sample_4 |
Sample_5 |
Sample_6 |
… |
rs6545625 |
2 |
57304896 |
2 |
2,33 |
2 |
2 |
2 |
2,02 |
… |
rs1424627 |
2 |
57307915 |
1,99 |
2 |
2 |
2,01 |
2 |
1,97 |
… |
rs7576091 |
2 |
57308888 |
2 |
1,96 |
2 |
2 |
1,99 |
1,93 |
… |
rs6755060 |
2 |
57339604 |
2 |
1,99 |
2 |
2,02 |
2 |
2 |
… |
rs1345941 |
2 |
57347229 |
2 |
2,03 |
2 |
2 |
2 |
2 |
… |
rs7604249 |
2 |
57348282 |
2 |
2 |
2 |
1,99 |
2 |
2 |
… |
rs4372943 |
2 |
57353504 |
1,99 |
1,99 |
2 |
2 |
1,99 |
1,98 |
… |
rs4614972 |
2 |
57357860 |
1,89 |
2 |
2 |
2 |
1,95 |
2 |
… |
rs4271799 |
2 |
57395320 |
2 |
2 |
2 |
2 |
2 |
2,11 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
Þ “CNstream_CNPseg.txt”
This file contains the information
about all the probe segments which copy number frequency exceeds the frequency
threshold (default=1%). Copy number polymorphism regions are then resumed here,
with multiple informative fields as the chromosome region (CHROM, BP_INIT,
BP_END), the percentage of amplifications and deletions over the samples
(%AMPS, %DELS) and the CN assigned to each sample. Deletions are defined as “
CHROM |
BP_INIT |
BP_END |
%AMPS |
%DELS |
Sample_1 |
Sample_2 |
Sample_3 |
Sample_4 |
Sample_5 |
… |
2 |
57307915 |
57348282 |
0,020 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57308888 |
57353504 |
0,020 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57339604 |
57357860 |
0,020 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57347229 |
57395320 |
0,018 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57348282 |
57395677 |
0,016 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15435527 |
15453141 |
0,000 |
0,061 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15439515 |
15455979 |
0,000 |
0,078 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15447669 |
15464497 |
0,000 |
0,078 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15450330 |
15467035 |
0,000 |
0,051 |
0 |
0 |
0 |
0 |
0 |
… |
19 |
20368239 |
20449621 |
0,000 |
0,094 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20385941 |
20473895 |
0,000 |
0,125 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20423788 |
20520617 |
0,000 |
0,125 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20439390 |
20522325 |
0,000 |
0,082 |
0 |
0 |
-1 |
0 |
0 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
When the input status file is provided, these additional fields are listed:
o P-VALUE: Significance value computed using a chi-square Case/Control association test.
o ODDS RATIO (OR)
o %A_CASES: Percentage of amplifications in cases.
o %D_CASES: Percentage of deletions in cases.
o %A_CONTROLS: Percentage of amplifications in controls.
o %D_CONTROLS: Percentage of deletions in controls.
CHROM |
BP_INIT |
BP_END |
P-Value |
OR |
%A_CASES |
%D_CASES |
%A_CONTROLS |
%D_CONTROLS |
Sample_1 |
Sample_2 |
Sample_3 |
Sample_4 |
Sample_5 |
… |
2 |
57307915 |
57348282 |
0,199 |
0,448 |
1,493 |
0,000 |
3,268 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57308888 |
57353504 |
0,199 |
0,448 |
1,493 |
0,000 |
3,268 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57339604 |
57357860 |
0,199 |
0,448 |
1,493 |
0,000 |
3,268 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57347229 |
57395320 |
0,114 |
0,358 |
1,194 |
0,000 |
3,268 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
2 |
57348282 |
57395677 |
0,056 |
0,267 |
0,896 |
0,000 |
3,268 |
0,000 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15435527 |
15453141 |
0,028 |
3,134 |
0,000 |
7,761 |
0,000 |
2,614 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15439515 |
15455979 |
0,012 |
3,234 |
0,000 |
9,851 |
0,000 |
3,268 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15447669 |
15464497 |
0,012 |
3,234 |
0,000 |
9,851 |
0,000 |
3,268 |
0 |
0 |
0 |
0 |
0 |
… |
8 |
15450330 |
15467035 |
0,089 |
2,491 |
0,000 |
6,269 |
0,000 |
2,614 |
0 |
0 |
0 |
0 |
0 |
… |
19 |
20368239 |
20449621 |
0,005 |
3,322 |
0,000 |
11,940 |
0,000 |
3,922 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20385941 |
20473895 |
0,017 |
2,265 |
0,000 |
14,925 |
0,000 |
7,190 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20423788 |
20520617 |
0,017 |
2,265 |
0,000 |
14,925 |
0,000 |
7,190 |
0 |
0 |
-1 |
0 |
0 |
… |
19 |
20439390 |
20522325 |
0,007 |
3,453 |
0,000 |
10,448 |
0,000 |
3,268 |
0 |
0 |
-1 |
0 |
0 |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
… |
|