Main
2-chi-2 is a statistical method to search for interactions in binary traits. From this web page you can download a free, open-source tool, designed to perform these analyses. It is fast enough to perform GWAS studies. Furthermore different free data sets can be downloaded to test the power of 2-chi-2 as well as other statistical methods.
2-chi-2, the software, and all the data is developed at the Grup de Recerca de Reumatologia (GRR) which is a research group from the Institut de Recerca de l'Hospital Universitari Vall d'Hebron (VHIR).
2-chi-2 method
2-chi-2 is based on the definition of two vectors. Let us suppose we have two contingency tables each one defining the genotype distribution for cases and controls. Then, for each table we can define a nine component vector as,
vi=sqrt(1/(1/ncases+1/ncontrols))*(pi-epi)/sqrt(epi)
Where p
i and ep
i account respectively for the probability and expected probability of the table cell i whereas n
cases and n
controls for the number of individuals of each table. Then, the statistic is defined as the square of the length of these two vectors difference,
Σi(v1i-v2i)2
In the last equation v1
i and v2
i represent the vector components of the cell i for case and control tables respectively. The sum is over all cells. Since each vector measures the shape and significance of the interaction on each table, their difference will measure the difference between the interaction of each table. This statistic can be generalized. Thus, the same procedure can be used to compare the statistical independence between two contingency tables of any dimension.
2-chi-2 software
A software to apply our statistical method to a set of SNPs can be freely downloaded from this
link as precompiled binaries or as a source code. To compile the code you need a c++ compiler, the standard libraries, and
Boost C++ libraries properly installed. As an input, the code use the data file formats of
PLINK tool. In particular, the data sources can be given as
PED files or
binary PED files.
The basic usage is,
2chi2 --input <file> --output <file>
The available options are,
| --b | | Must be used when the input file is a binary PED |
| --threshold | | Sets the significance threshold. Results with higher significance will be stored. (default value is 0.01) |
The output file has the following structure,
| SNP1 | SNP2 | P | EV<5 |
| rs3748597 | rs4970405 | 1.0e-3 | 0 |
| rs3748597 | rs4648764 | 4.2e-5 | 1 |
| rs3748597 | rs6603811 | 7.1e-6 | 0 |
| rs4970405 | rs7531583 | 1.1e-2 | 2 |
| ... | ... | ... | ... |
The first two columns define the two interacting SNPs, followed with a third column with the obtained p-value. The last column accounts for the number of cells with an expected number of individuals <5. This is important to take into account since sparsely populated cells can lead to overestimate the significance. Ideally these value should be 0. Contrarywise, the result should be studied further, for instance by looking whether the high significance becomes or not from the vector components computed on the underpopulated cells.
Data for the 48 basic epistatic models
The data is divided in two files. The file models48.dat contains the definition of each epistatic model. The file data48.dat contains each generated data set obtained from different biological parameters applied to each epistatic model. These files can be downloaded
here. Their format is as follows:
models48.dat
The file has the following structure,
| Model | G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 |
| M1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| M2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| M3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The first column defines the model number. The other columns define each one of the 9 genotipic configurations of a two-loci genotype. These define the odds between cases an controls and their values can be 0 or 1. 0 indicates that the odds is α whereas 1 must be substituted by α(1+θ). For instance, the model M2 is defined by the odds table,
| aa | Aa | AA |
| bb | α | α | α |
| Bb | α | α | α |
| BB | α | α(1+β) | α |
The values of α and θ will depend on the biological parameters used by the model (see paper).
data48.dat
The file has the following structure,
| Parameters | Cases | Controls |
| Model | S1 | S2 | Prev | Odds | G1 | ... | G9 | G1 | ... | G9 |
| M1 | 0.1 | 0.1 | 0.01 | 2 | 0.13 | ... | 0.07 | 0.03 | ... | 0.21 |
| M1 | 0.1 | 0.1 | 0.01 | 3 | 0.27 | ... | 0.02 | 0.14 | ... | 0.02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The first five columns define the model number (Model), the two allele frequencies (S1 and S2), the prevalence (Prev), and the odds ratio (Odds) respectively. Column six onwards are formed by two groups of nine columns. These define the genotype probabilities for cases and controls. These values are computed from the epistatic model and biological parameters of first columns.
Data for the 2488 extended epistatic models
The data is divided in two files. The file models2488.dat contains the definition of each epistatic model. The file data2488.dat contains each generated data set obtained from different biological parameters applied to each epistatic model. These files can be downloaded
here. Their format is as follows:
models2488.dat
The file has the following structure,
| Model | G1 | G2 | G3 | G4 | G5 | G6 | G7 | G8 | G9 |
| ME1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 |
| ME2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | -1 |
| ME3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| ME4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 1 |
| ME5 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | -1 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The first column defines the extended model number. The other columns define each one of the 9 genotipic configurations of a two-loci genotype. These define the odds between cases an controls. We suppose that a genotype configuration can be of risk, protective or neutral. Accordingly, their values can be 1, -1, or 0 respectively. 1 (risk genotype) must be substituted by α(1+θ), -1 (protective genotype) must be substituted by αγ/(1+θ), and finally 0 (neutral genotype) indicates that the odds is α. For instance, the model ME5 is defined by the odds table,
| aa | Aa | AA |
| bb | α | α | α |
| Bb | α | α | α |
| BB | α | α(1+β) | αγ/(1+θ) |
The values of α and θ will depend ont the biological parameters used by the model (see paper).
data2488.dat
The file has the following structure,
| Parameters | Cases | Controls |
| Model | S1 | S2 | Prev | R | Odds | G1 | ... | G9 | G1 | ... | G9 |
| ME1 | 0.1 | 0.1 | 0.01 | 0.01 | 2 | 0.13 | ... | 0.07 | 0.03 | ... | 0.21 |
| ME1 | 0.1 | 0.1 | 0.01 | 0.01 | 3 | 0.27 | ... | 0.02 | 0.14 | ... | 0.02 |
| ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... | ... |
The first six columns define the model number (Model), the two allele frequencies (S1 and S2), the prevalence (Prev), the γ parameter (R), and the odds ratio (Odds) respectively. Column seven onwards are formed by two groups of nine columns. These define the genotype probabilities for cases and controls. These values are computed from the epistatic model and biological parameters of first columns.
Download
Note: To compile the source you need
Boost C++ libraries properly installed. After compiling, please edit the Makefile file and
insert the adequate changes.
Note: To use the MPI parallelized version, you must have the adequate MPI libraries and you must compile from the source.