PMA: Pathway Methylation Analysis

Determining differentially methylated pathways using Illumina 450K arrays

We have developed a pathway analysis approach for epigenome-wide methylation data. The method has been developed for data generated by the Illumina 450K array although it can be easily adapted to any other platform. Here you can download the source code of the method and the compiled binary package. We have included a small dataset to show the typical usage of the method.

Download

Package Source: PMA_0.1.tar.gz
Windows Binary: PMA_0.1.zip

Usage

After installing the package:

library("PMA")

We can upload the CpG probe mapping to genes:

data("Anot450k_flank2kb")

In this data frame, probes are mapped to the nearest gene (+/-2,000 flanking pairbases).

Anot450k_flank2kb[10000:10010,]
##            probe chr        bp      Gene
## 10000 cg14985987   1 222763457 TAF1A-AS1
## 10001 cg14988142   1 229643771    NUP133
## 10002 cg14988680   1 212688417      <NA>
## 10003 cg14989462   1  26798952     DHDDS
## 10004 cg14991984   1  99470129    PLPPR5
## 10005 cg14992232   1  20811153   CAMK2N1
## 10006 cg14993123   1   7831245     VAMP3
## 10007 cg14993530   1 245027115    HNRNPU
## 10008 cg14995036   1  94146705      <NA>
## 10009 cg14995716   1  32687918     EIF3I
## 10010 cg15007228   1   2210335       SKI

A customized probe mapping data set can be easily built. The four column names should be preserved, and the probes (rows) should be on the same order as the analysis dataset.

As a toy example, we can use the methylation data on isolated whole blood cell types available at the “FlowSorted.Blood.450k” package in Bioconductor https://www.bioconductor.org/ (1). We start by extracting the methylation Beta values.

library("FlowSorted.Blood.450k")
data(FlowSorted.Blood.450k)
B <- getBeta(FlowSorted.Blood.450k) 

We can now use PMA to look for differentially methylated pathways between different cell types (the methylation levels are large between cell types, so many pathways will appear significant in this case).
In this example we will test the differential methylation between CD19+ B lymphocytes against the remaining cell types. We will use a small pathway subset from the Biocarta database http://cgap.nci.nih.gov/Pathways/BioCarta_Pathways. This consists on a tab-delimited file, with the first column corresponding to the pathway or gene set name, and the remaining columns corresponding to the list of genes.

# Phenotype variable
pheno <- rep(0, ncol(B))
pheno[grep("CD19", colnames(B))] <- 1
# Output file name
pathout <- "CD19_DMPs.txt"
# Path for file with pathway gene listings
genesetpath <- "data/biocarta_small.txt"
# Run PMA 
PMA(B, pheno, covar=NULL, pathout=pathout, genesetpath=genesetpath,
       minNgenes = 10, maxNgenes = 300, annotation.data.frame = Anot450k_flank2kb,
       P.thresh=0.05, maxN=5, NPERM=100, P_THRESH=0.01, R_THRESH = 0.2,
       distCPG = 500,skipPermuting=TRUE, verbose=TRUE )

In this case, no covariates (covar argument) were included in the association testing. Covariables are included as a matrix, with columns corresponding to each of the relevant variables and rows corresponding to the value of each individual.

The analysis parameters in PMA are:

  1. Minimal and maximal number of genes per pathway (minNgenes and maxNgenes). Most pathways are in the 10 to 300 gene range so it is safe to use the default values. P.thresh is the P-value threshold for the selection of associated CpGs, and maxN corresponds to the maximal number of significant CpGs that are included to calculate the average statistic for each pathway.

  2. NPERM determines to the number of permutations that are used to calculate the empirical P-values. Precision will increase with the number of permutations but also will increase computation time. For large number of pathways, parallelization of this function might be convenient.

  3. R_THRESH and distCPG are the neighboring probe correlation parameters used in the probe filtering process; by default, probes at <500 basepairs and with a correlation higher than \(r^{2}\)=0.2 will be filtered out.

  4. skipPermuting and P_THRESH are used to avoid starting permutation calculations for a pathway with a non-significant original P-value.

Running the PMA function generates a tab-delimited file with the original association significance (“Pori”) and the empirical P-value generated after N permutations (“Pperm”) .

References

  1. Andrew E Jaffe. FlowSorted.Blood.450k: Illumina HumanMethylation data on sorted blood cell populations. R package version 1.6.0.