v1.0
CONTENT

### Main

2-chi-2 is a statistical method to search for interactions in binary traits. From this web page you can download a free, open-source tool, designed to perform these analyses. It is fast enough to perform GWAS studies. Furthermore different free data sets can be downloaded to test the power of 2-chi-2 as well as other statistical methods.

2-chi-2, the software, and all the data is developed at the Grup de Recerca de Reumatologia (GRR) which is a research group from the Institut de Recerca de l'Hospital Universitari Vall d'Hebron (VHIR).

### 2-chi-2 method

2-chi-2 is based on the definition of two vectors. Let us suppose we have two contingency tables each one defining the genotype distribution for cases and controls. Then, for each table we can define a nine component vector as,
vi=sqrt(1/(1/ncases+1/ncontrols))*(pi-epi)/sqrt(epi)
Where pi and epi account respectively for the probability and expected probability of the table cell i whereas ncases and ncontrols for the number of individuals of each table. Then, the statistic is defined as the square of the length of these two vectors difference,
Σi(v1i-v2i)2
In the last equation v1i and v2i represent the vector components of the cell i for case and control tables respectively. The sum is over all cells. Since each vector measures the shape and significance of the interaction on each table, their difference will measure the difference between the interaction of each table. This statistic can be generalized. Thus, the same procedure can be used to compare the statistical independence between two contingency tables of any dimension.

### 2-chi-2 software

A software to apply our statistical method to a set of SNPs can be freely downloaded from this link as precompiled binaries or as a source code. To compile the code you need a c++ compiler, the standard libraries, and Boost C++ libraries properly installed. As an input, the code use the data file formats of PLINK tool. In particular, the data sources can be given as PED files or binary PED files.
The basic usage is,
2chi2 --input <file> --output <file>
The available options are,
--b Must be used when the input file is a binary PED Sets the significance threshold. Results with higher significance will be stored. (default value is 0.01)
The output file has the following structure,
SNP1SNP2PEV<5
rs3748597rs49704051.0e-30
rs3748597rs46487644.2e-51
rs3748597rs66038117.1e-60
rs4970405rs75315831.1e-22
............
The first two columns define the two interacting SNPs, followed with a third column with the obtained p-value. The last column accounts for the number of cells with an expected number of individuals <5. This is important to take into account since sparsely populated cells can lead to overestimate the significance. Ideally these value should be 0. Contrarywise, the result should be studied further, for instance by looking whether the high significance becomes or not from the vector components computed on the underpopulated cells.

### Data for the 48 basic epistatic models

The data is divided in two files. The file models48.dat contains the definition of each epistatic model. The file data48.dat contains each generated data set obtained from different biological parameters applied to each epistatic model. These files can be downloaded here. Their format is as follows:

#### models48.dat

The file has the following structure,
ModelG1G2G3G4G5G6G7G8G9
M1000000001
M2000000010
M3000000011
..............................
The first column defines the model number. The other columns define each one of the 9 genotipic configurations of a two-loci genotype. These define the odds between cases an controls and their values can be 0 or 1. 0 indicates that the odds is α whereas 1 must be substituted by α(1+θ). For instance, the model M2 is defined by the odds table,
aaAaAA
bbααα
Bbααα
BBαα(1+β)α
The values of α and θ will depend on the biological parameters used by the model (see paper).

#### data48.dat

The file has the following structure,
ParametersCasesControls
ModelS1S2PrevOddsG1...G9G1...G9
M10.10.10.0120.13...0.070.03...0.21
M10.10.10.0130.27...0.020.14...0.02
.................................
The first five columns define the model number (Model), the two allele frequencies (S1 and S2), the prevalence (Prev), and the odds ratio (Odds) respectively. Column six onwards are formed by two groups of nine columns. These define the genotype probabilities for cases and controls. These values are computed from the epistatic model and biological parameters of first columns.

### Data for the 2488 extended epistatic models

The data is divided in two files. The file models2488.dat contains the definition of each epistatic model. The file data2488.dat contains each generated data set obtained from different biological parameters applied to each epistatic model. These files can be downloaded here. Their format is as follows:

#### models2488.dat

The file has the following structure,
ModelG1G2G3G4G5G6G7G8G9
ME1000000001
ME200000000-1
ME3000000010
ME4000000011
ME500000001-1
..............................
The first column defines the extended model number. The other columns define each one of the 9 genotipic configurations of a two-loci genotype. These define the odds between cases an controls. We suppose that a genotype configuration can be of risk, protective or neutral. Accordingly, their values can be 1, -1, or 0 respectively. 1 (risk genotype) must be substituted by α(1+θ), -1 (protective genotype) must be substituted by αγ/(1+θ), and finally 0 (neutral genotype) indicates that the odds is α. For instance, the model ME5 is defined by the odds table,
aaAaAA
bbααα
Bbααα
BBαα(1+β)αγ/(1+θ)
The values of α and θ will depend ont the biological parameters used by the model (see paper).

#### data2488.dat

The file has the following structure,
ParametersCasesControls
ModelS1S2PrevROddsG1...G9G1...G9
ME10.10.10.010.0120.13...0.070.03...0.21
ME10.10.10.010.0130.27...0.020.14...0.02
....................................
The first six columns define the model number (Model), the two allele frequencies (S1 and S2), the prevalence (Prev), the γ parameter (R), and the odds ratio (Odds) respectively. Column seven onwards are formed by two groups of nine columns. These define the genotype probabilities for cases and controls. These values are computed from the epistatic model and biological parameters of first columns.