Simone Ecker
[email protected]
Medical Genomics, UCL Cancer Institute, London, UK
Last updated: 24 Jun 2020
The main R packages used for DNA methylation data preprocessing are:
Minfi
1ChAMP
2,3RnBeads
4,5
IDAT files obtained from Illumina’s iScan are directly imported into R as raw data using minfi
’s1,6 read function. The corresponding sample sheet and, if necessary, a specific array manifest file are also read in. Data are then converted into an RGChannelSet
containing the raw intensities in the green and red channels as well as the intensities of internal control probes.
For quality control in minfi
, the RGChannelSet
is converted into a MethylSet
to generate methylated and unmethylated signals from the red and green intensities according to the microarray’s probe design. This does not perform any normalization. Then, the following quality control plots are produced:
- Log median intensity in methylated and unmethylated channels via
getMeth()
andgetQC()
- Densities:
densityPlot()
- Density bean plots:
densitybeanPlot()
Minfi
can produce several control probes plots for quality assessment:
- The
plotQC()
function provides a simple quality control plot using the log median intensity in both the methylated and unmethylated channels where 'good' sampels should cluster together, while 'failed' samples tend to separate and have lower median intensities - Control probes plot for bisulfite conversion (for probe type I and II):
controlStripPlot()
- All these plots can be generated along with further informative control probes plots in one step using the function
qcReport()
In addition, quality checks as implemented in Illumina’s BeadArray Controls Reporter7 can be performed with an R package ewastools
published by Heiss et al.8 which evaluates 17 control metrics calculated from control probes. The following metrics are calculated and reported back in numbers and figures can also be generated (most of them are also shown in a different style in minfi
’s qcReport):
- Restoration
- Staining green
- Staining red
- Extension green
- Extension red
- Hybridization high/medium
- Hybridization medium/low
- Target removal 1
- Target removal 2
- Bisulfite conversion I green
- Bisulfite conversion I red
- Bisulfite conversion II
- Specificity I green
- Specificity I red
- Specificity II
- Non-polymorphic green
- Non-polymorphic red
HumMethQCReport
9 (currently not available for EPIC) generates the following figures:
- Barplots of Illumina internal controls
- Intensity at high and low Beta-values
- Percentage of non-detected CpGs
- Average detection p-value
Furthermore, control probes plots as provided by Illumina GenomeStudio10 can be obtained via the plotCtrl()
function of the R package ENmix
11:
- Bisulfite conversion I red and green
- Bisulfite conversion II red and green
- Extension red and green
- Hybridization red and green
- Negative red and green
- Non-polymorphic red and green
- Specificity I red and green
- Specificity II red and green
- Staining red and green
- Target removal red and green
- Norm A red and green
- Norm C red and green
- Norm G red and green
- Norm T red and green
- Norm ACGT red and green
An alternative version of these plots can be generated with shinyMethyl
12, an interactive tool providing a convinient way of identifying potentially problematic samples. The package also provides the possibility to visually inspect the distribution of covariates across slides (currently not working for EPIC).
The Minfi
, RnBeads
, ENmix
and ewastools
packages further provide the option to identify potential sample mix-ups using high-frequency SNP probes. For the same purpose of detecting possible mismatches, the sex of the sample donor can be determined based on the methylation signal at sex chromosomes. Sample contamination can be assessed with ewastools
.
A RatioSet
object can be generated to store actual methylation values instead of the methylated and unmethylated signals (a copy number matrix can also be obtained if required). The RatioSet
is generated with the function ratioConvert()
. Subsequently, both a Beta- and a M-value matrix are obtained from the RatioSet
via getBeta()
and getM()
. Alternatively, the methylation value matrices can also be obtained from an RGChannelset
or MethylSet
.
The following plots for quality control are produced from both the Beta- and M-value matrix:
- Sample-wise density plots colored by phenotype of interest
- Group-wise density plots (CpG-wise mean per group)
- Principal component analysis (PCA) using all CpGs, colored by phenotype of interest
- Multidimensional scaling (MDS) using all CpGs, colored by phenotype of interest
- Hierarchical clusterings (complete, average, median, wardD, centroid and single) performed on Euclidean distances, including a heatmap and dendrograms, colored by phenotype of interest
- PCA, MDS and clusterings as above, but colored by all additional covariates to be assessed, in particular, technical (or biological) variables that could potentially lead to batch effects
- Singular Value Decomposition (SVD) plots examining associations between principal components and all technical and biological covariates available
- Age distribution per phenotype of interest (overlayed histograms or density plots)
Probes meeting the following criteria are removed from all samples:
- Detection p-values > 0.01 in ≥ 1 sample or more stringent thresholds as recently suggested12,13
- Bead count < 3 in ≥ 5% of the samples
- Non-cg probes
- Ambiguous genomic locations (cross-hybridization) as provided by Zhou et al.14 (not to be removed for epigenetic clock analyses)
- SNPs of the corresponding population provided by Zhou et al.14 (not to be removed for epigenetic clock analyses)
- Sex chromosomes (unless they are needed for specific analyses, and also not to be removed for epigenetic clock analyses)
- Optionally: Non-variable probes (only if required for specific reasons)
Bisulfite conversion of DNA and other experimental steps introduce assay variability and batch effects. Technical variation should be reduced as much as possible. Illumina 450K and EPIC BeadChips use two fluorescent dyes (Cy3-green and Cy5-red) and two different probe types (Infinium I and Infinium II), leading to technical biases which should be corrected. Background noise also needs to be removed. To this aim, negative control probes and out-of-band (OOB) probes are present on Illumina’s arrays.
Many different normalization methods are available and the normalization method to be applied depends on each dataset, the research question and the downstream analyses to be performed. Usually, suitable normalization methods are performed (on either Beta- or M-values as required by the method), and their results are compared to each other. Then, the quality assessment steps listed in section 3.4 are repeated to help to choose the most appropriate method(s) for the specific dataset and research question. Individual steps from different normalization methods can also be combined and performed sequentially to obtain optimal results.
For example, normalization methods that perform (only) specific tasks such as dye bias, probe type bias or background correction may be followed by normalization methods that remove technical variance across arrays. Several normalization methods combine the correction of the above-mentioned technical biases and subsequent global normalization including both within- and across-sample normalization to remove unwanted technical variance. Across-sample normalization may not always be suitable15. A few methods that only apply within-sample normalization are available such as for example PBC and ssNoob. These normalization methods are particularly useful for datasets where the global distribution of DNA methylation levels is expected to differ significantly across samples15 and for projects where the normalization of single samples needs to be reproducible independently from the rest of the samples, or where new additional samples will be added to the dataset over time6.
- Illumina1 (available in
minfi
):
reverse engineered Illumina GenomeStudio10 control normalization with background subtraction - Lumi16 (available in
methylumi
):
background correction, variance stabilization and normalization - PBC17 (Peak-Based Correction, available in
ChAMP
):
intra-array probe type bias correction based on global profiles - Quantile1,18 (available in
minfi
):
global quantile normalization performing both sample normalization and probe type correction - Quantile stratified1,18 (available in
minfi
):
as above, but with quantile normalization performed within genomic region strata - SWAN19 (Subset-quantile Within Array Normalization, available in
ChAMP
):
reduces technical variation within and between arrays - BMIQ20 (Beta-MIxture Quantile normalization, available in
ChAMP
):
intra-array normalization which adjusts for probe type bias - Noob21(Normal-exponential Out-Of-Band, available in
minfi
):
background correction method with dye-bias normalization - Data-driven normalization22 (available in
wateRmelon
):
background correction, between-array quantile normalization of methylated and unmethylated signal intensities and the two probe types separately (for ‘dasen’ normalization, for example) - FunNorm23 (Functional Normalization, available in
ChAMP
):
between-array quantile normalization of methylated and unmethylated signal intensities and the two probe types separately, uses control probes to remove unwanted technical variation - ENmix11(Exponential-Normal mixture signal intensity background correction,
ENmix
package):**
background correction, probe type (RCP24, Regression on Correlated Probes) and dye bias correction (RELIC25, Regression on Logarithm of Internal Control probes), inter-array normalization - ssNoob6 (singe-sample Normal-exponential out-of-band, available in
minfi
):
intra-array normalization suitable for incremental preprocessing of individual samples and when integrating data from multiple generations of Illumina microarrays - SeSAMe26(SEnsible Step-wise Analysis of Methylation data,
SeSAMe
package):
probe quality masking withpOOBAH
(p-value with Out-Of-Band probes for Array Hybridization), bleed-through correction in background subtraction, non-linear dye bias correction, control for bisulfite conversion, reduction of inter- and intra-array technical variation
In our experience, both BMIQ and Funnorm together with ENmix's probe type correction often give good results with successful reduction of technical variation and correction for background and probe type bias. Dye-bias normalization can be achieved with Noob, for example.
The mentioned methods provide appropriate results for analyses of global DNA methylation and variability patterns, rank-based approaches, site- and region-based analysis of differential mean and variability of DNA methylation, and epigenetic clock analyses. Probe type bias correction is particularly important for the assessment of global DNA methylation patterns, clusterings, rank-based approaches and region-based analyses. The best possible removal of all technical variation is crucial under all circumstances, but even more critical for analyses of biological variation such as epigenetic drift and differential variability.
SWAN works well when no global shifts in DNA methylation are expected while BMIQ is better suited when significant global differences are present between the phenotypes of interest (see above). This can be the case when comparing normal samples with samples of a cancer type in which global hypomethylation occurs, for example. The quantro method15 can aid the decision by assessing such global differences.
Quantile and quantile stratified normalization are less well suited for DNA methylation analyses and can introduce additional bias. Illumina’s GenomeStudio normalization method does not reduce technical variation27,28. It can even introduce additional variability29. FunNorm and BMIQ can sometimes also lead to biased distributions, in particular for DNA methylation M-values. SWAN, BMIQ and PBC have been shown to outperform other methods27,29,30. PBC can lead to discontinuities in probe type II density distributions under certain circumstances30 and might thus not always be the best suitable approach. All principally suitable methods should be applied to the data, and their results should be assessed carefully (see above).
Useful methods to detect batch effects and biological confounders are Surrogate Variable Analysis (SVA) implemented in the sva
package31 and Singular Value Decomposition (SVD)32 available in ChAMP
, as well as the pcrplot()
function of ENMix
which uses a linear regression approach but might work less well than SVD for categorical variables. These methods assess the impact of significant components of variation present in a dataset. Unknown sources of variation can be detected, and importantly, the contribution of all available technical and biological covariates can and should be assessed. One 450K or EPIC BeadChip accommodates multiple samples (n=12 and n=8, respectively) which can cause a batch effect associated with array ID (‘Sentrix ID’). The arrays’ position on the BeadChip (row and column, ‘Sentrix Position’) also impacts the data. If any significant batch effects or unwanted confounders are observed, the data can be corrected for these effects by the application of ComBat33 as long as they are not directly confounded with the phenotype of interest.
ComBat is the most widely used method for batch effect correction. It is implemented in sva
, minfi
, ChAMP
and RnBeads
and usually works best on DNA methylation M-values. However, sometimes better results are achieved when applying ComBat to Beta-values instead. Thus, it can be useful to perform ComBat on both M-values and Beta-values and assess the results with the help of SVA/SVD and the quality assessment plots described in section 3.4 to compare them to each other.
When working with composite samples such as whole blood, peripheral blood mononuclear cells (PBMCs), or brain tissue, cell composition differences across samples may need to be corrected. The most common reference-based method to assess and correct for subpopulation differences is Houseman’s algorithm34 in combination with a blood cell reference dataset of the six most common white blood cell types35 available in minfi
, ChAMP
and RnBeads
. A newer blood cell reference was published in 201836. Here, blood cells were isolated from 31 male and six female healthy donors while for the previous reference dataset cells were obtained from six adult males.
Reference-free methods also exist37–41. The results of the correction should be assessed visually by plotting the cell type distributions per sample, mean cell type distributions per group, and by repeating the quality assessment plots that are listed in section 3.4.
Two interesting new methods to identify the cell types driving differential DNA methylation signal in composite samples such as whole blood have recently been developed: CellDMC (DMC = Differentially Methylated Cytosines)42 and TCA (Tensor Composition Analysis)43.
Based on the initial quality assessment, the control probe plots, and the visual inspection of all global plots (densities, PCA, MDS and hierarchical clusterings, see section 3.4) performed before and after each preprocessing step, samples that turn out to be outliers corresponding to multiple measurements and plots might be removed from the dataset for downstream analyses.
To further aid the detection of potential outlier samples, the following functions and algorithms can be used:
detectOutlier()
of the R packageLumi
outlyx()
of the R packagewateRmelon
- The locFDR outlier detection method as described in Hannum et al., Mol Cell, 201344
As mentioned, some methods require Beta-values as input. M-values can be converted into Beta-values and vice versa using the following functions implemented in RnBeads:
beta2mval <- function(betas, epsilon = 0.00001) {
if (!is.numeric(betas)) {
stop("invalid value for betas")
}
if (!(is.numeric(epsilon) && length(epsilon) == 1 && (!is.na(epsilon)))) {
stop("invalid value for epsilon")
}
if (epsilon < 0 || epsilon > 0.5) {
stop("invalid value for epsilon; expected 0 <= epsilon <= 0.5")
}
betas[betas < epsilon] <- epsilon
betas[betas > (1 - epsilon)] <- 1 - epsilon
return(log2(betas / (1 - betas)))
}
mval2beta <- function(mvals) {
if (!is.numeric(mvals)) {
stop("invalid value for mvals")
}
intmf<-2^(mvals)
return(intmf/(intmf+1))
}
M-values are used for all downstream statistical analyses such as differential mean methylation or analyses of variability and epigenetic drift. Beta-values are only used for the visualization of DNA methylation patterns.
An exception are epigenetic clocks to perform epigenetic age estimations45–48 and other epigenetic predictors. Most of these methods require Beta-values as input.
-
Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–9 (2014).
-
Morris, T. J. et al. ChAMP: 450k Chip Analysis Methylation Pipeline. Bioinformatics 30, 428–430 (2013).
-
Tian, Y. et al. ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics 2014–6 (2017). doi:10.1093/bioinformatics/btx513
-
Assenov, Y. et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat Methods 11, (2014).
-
Mueller, F. et al. RnBeads 2.0: comprehensive analysis of DNA methylation data. Genome Biol 20, 55 (2019).
-
Fortin, J. P., Triche, T. J. & Hansen, K. D. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics 33, 558–60 (2017).
-
Illumina Inc. BeadArray Controls Reporter: Software Guide. (2015). doi:10.17485/IJST/2016/V9I47/94892
-
Heiss, J. A. & Just, A. C. Identifying mislabeled and contaminated DNA methylation microarray data: An extended quality control toolset with examples from GEO. Clin Epigenetics 10, (2018).
-
Mancuso, F. M., Montfort, M., Carreras, A., Alibés, A. & Roma, G. HumMeth27QCReport: An R package for quality control and primary analysis of Illumina Infinium methylation data. BMC Res Notes 4, 546 (2011).
-
Illumina Inc. GenomeStudio Methylation Module v1.8 User Guide. Illumina (2008).
-
Xu, Z., Niu, L., Li, L. & Taylor, J. A.: ENmix A novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res 44, e20 (2016).
-
Lehne, B. et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol 16, 37 (2015).
-
Fortin J., Fertig E. J. & Hansen K. D. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Research, 3, 175 (2014).
-
Heiss, J. A. & Just, A. C. Improved filtering of DNA methylation microarray data by detection p values and its impact on downstream analyses. Clin Epigenetics 11, 15 (2019).
-
Zhou, W., Laird, P. W. & Shen, H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res 45, e22 (2017).
-
Hicks, S. & Irizarry, R. Quantro: a Data-Driven Approach To Guide the Choice of an Appropriate Normalization Method. Genome Biol 16, 117 (2015).
-
Du, P., Kibbe, W. A. & Lin, S. M. lumi: A pipeline for processing Illumina microarray. Bioinformatics 24, 1547–8 (2008).
-
Dedeurwaerder, S. et al. Evaluation of the Infinium Methylation 450K technology. Epigenomics 3, 771–784 (2011).
-
Touleimat, N. & Tost, J. Complete pipeline for Infinium® Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 4, 325–41 (2012).
-
Maksimovic, J., Gordon, L. & Oshlack, A. SWAN: Subset-quantile Within Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome Biol 13, R44 (2012).
-
Teschendorff, A. E. et al. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29, 189–96 (2013).
-
Triche, T. J., Weisenberger, D. J., Van Den Berg, D., Laird, P. W. & Siegmund, K. D. Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res 41, e90 (2013).
-
Pidsley, R. et al. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics 14, 293 (2013).
-
Fortin, J.-P. et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol 15, 503 (2014).
-
Niu, L., Xu, Z. & Taylor, J. A. RCP: A novel probe design bias correction method for Illumina Methylation BeadChip. Bioinformatics 32, 2659–63 (2016).
-
Xu, Z., Langie, S. A. S., De Boever, P., Taylor, J. A. & Niu, L. RELIC: A novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics 18, 4 (2017).
-
Zhou, W., Triche, T. J., Laird, P. W. & Shen, H. SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res 46, e123 (2018).
-
Marabita, F. et al. An evaluation of analysis pipelines for DNA methylation profiling using the illumina humanmethylation450 BeadChip platform. Epigenetics 8, 333–46 (2013).
-
Yousefi, P. et al. Considerations for normalization of DNA methylation data by Illumina 450K BeadChip assay in population studies. Epigenetics 8, 1141–52 (2013).
-
Wu, M. C. et al. A systematic assessment of normalization approaches for the Infinium 450K methylation platform. Epigenetics 9, 318–29 (2014).
-
Wang, T. et al. A systematic study of normalization methods for Infinium 450K methylation data using whole-genome bisulfite sequencing data. Epigenetics 10, 662–9 (2015).
-
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–3 (2012).
-
Teschendorff, A. E. et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One 4, e8274 (2009).
-
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–27 (2007).
-
Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, (2012).
-
Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7, (2012).
-
Salas, L. A. et al. An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray. Genome Biol 19, 64 (2018).
-
Houseman, E. A. et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics 17, 259 (2016).
-
Kaushal, A. et al. Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinformatics 18, 216 (2017).
-
Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Epigenome-wide association studies without the need for cell-type composition. Nat Methods 11, 309–11 (2014).
-
Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat Methods 13, 443–445 (2016).
-
Rahmani, E. et al. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nat Methods 14, 218–219 (2017).
-
Zheng, S. C., Breeze, C. E., Beck, S. & Teschendorff, A. Identification of differentially methylated cell-types in Epigenome-Wide Association Studies. Nat Genet Methods 15, 1059–66 (2018).
-
Rahmani, E. et al. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nat Commun 10, 3417 (2019).
-
Hannum, G. et al. Genome-wide Methylation Profiles Reveal Quantitative Views of Human Aging Rates. Mol Cell 49, 359–67 (2013).
-
Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol 14, R115 (2013).
-
Levine, M. E. et al. An epigenetic biomarker of aging for lifespan and healthspan. Aging (Albany NY) 10, 573–91 (2018).
-
Horvath, S. et al. Epigenetic clock for skin and blood cells applied to Hutchinson Gilford Progeria Syndrome and ex vivo studies. Aging (Albany NY) 10, 1758–75 (2018).
-
Lu, A. T. et al. DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging (Albany NY) 11, 303-327 (2019).