DNA Methylation Data Preprocessing Summary

Simone Ecker s.ecker@ucl.ac.uk
Medical Genomics, UCL Cancer Institute, London, UK

Last updated: 24 Jun 2020

1 Main R packages

The main R packages used for DNA methylation data preprocessing are:

Minfi¹
ChAMP^2,3
RnBeads^4,5

2 Data import

IDAT files obtained from Illumina’s iScan are directly imported into R as raw data using minfi’s^1,6 read function. The corresponding sample sheet and, if necessary, a specific array manifest file are also read in. Data are then converted into an RGChannelSet containing the raw intensities in the green and red channels as well as the intensities of internal control probes.

3 Quality control

3.1 Initial quality control

For quality control in minfi, the RGChannelSet is converted into a MethylSet to generate methylated and unmethylated signals from the red and green intensities according to the microarray’s probe design. This does not perform any normalization. Then, the following quality control plots are produced:

Log median intensity in methylated and unmethylated channels via getMeth() and getQC()
Densities: densityPlot()
Density bean plots: densitybeanPlot()

3.2 Quality assessment using control probes

Minfi can produce several control probes plots for quality assessment:

The plotQC()function provides a simple quality control plot using the log median intensity in both the methylated and unmethylated channels where 'good' sampels should cluster together, while 'failed' samples tend to separate and have lower median intensities
Control probes plot for bisulfite conversion (for probe type I and II): controlStripPlot()
All these plots can be generated along with further informative control probes plots in one step using the function qcReport()

In addition, quality checks as implemented in Illumina’s BeadArray Controls Reporter⁷ can be performed with an R package ewastools published by Heiss et al.⁸ which evaluates 17 control metrics calculated from control probes. The following metrics are calculated and reported back in numbers and figures can also be generated (most of them are also shown in a different style in minfi’s qcReport):

Restoration
Staining green
Staining red
Extension green
Extension red
Hybridization high/medium
Hybridization medium/low
Target removal 1
Target removal 2
Bisulfite conversion I green
Bisulfite conversion I red
Bisulfite conversion II
Specificity I green
Specificity I red
Specificity II
Non-polymorphic green
Non-polymorphic red

HumMethQCReport⁹ (currently not available for EPIC) generates the following figures:

Barplots of Illumina internal controls
Intensity at high and low Beta-values
Percentage of non-detected CpGs
Average detection p-value

Furthermore, control probes plots as provided by Illumina GenomeStudio¹⁰ can be obtained via the plotCtrl() function of the R package ENmix¹¹:

Bisulfite conversion I red and green
Bisulfite conversion II red and green
Extension red and green
Hybridization red and green
Negative red and green
Non-polymorphic red and green
Specificity I red and green
Specificity II red and green
Staining red and green
Target removal red and green
Norm A red and green
Norm C red and green
Norm G red and green
Norm T red and green
Norm ACGT red and green

An alternative version of these plots can be generated with shinyMethyl¹², an interactive tool providing a convinient way of identifying potentially problematic samples. The package also provides the possibility to visually inspect the distribution of covariates across slides (currently not working for EPIC).

3.3 Identification of sample mismatches

The Minfi, RnBeads, ENmix and ewastools packages further provide the option to identify potential sample mix-ups using high-frequency SNP probes. For the same purpose of detecting possible mismatches, the sex of the sample donor can be determined based on the methylation signal at sex chromosomes. Sample contamination can be assessed with ewastools.

3.4 Quality assessment based on all cg probes used for downstream analyses

A RatioSet object can be generated to store actual methylation values instead of the methylated and unmethylated signals (a copy number matrix can also be obtained if required). The RatioSet is generated with the function ratioConvert(). Subsequently, both a Beta- and a M-value matrix are obtained from the RatioSet via getBeta() and getM(). Alternatively, the methylation value matrices can also be obtained from an RGChannelset or MethylSet.

The following plots for quality control are produced from both the Beta- and M-value matrix:

Sample-wise density plots colored by phenotype of interest
Group-wise density plots (CpG-wise mean per group)
Principal component analysis (PCA) using all CpGs, colored by phenotype of interest
Multidimensional scaling (MDS) using all CpGs, colored by phenotype of interest
Hierarchical clusterings (complete, average, median, wardD, centroid and single) performed on Euclidean distances, including a heatmap and dendrograms, colored by phenotype of interest
PCA, MDS and clusterings as above, but colored by all additional covariates to be assessed, in particular, technical (or biological) variables that could potentially lead to batch effects
Singular Value Decomposition (SVD) plots examining associations between principal components and all technical and biological covariates available
Age distribution per phenotype of interest (overlayed histograms or density plots)

4 Probe Filtering

Probes meeting the following criteria are removed from all samples:

Detection p-values > 0.01 in ≥ 1 sample or more stringent thresholds as recently suggested^12,13
Bead count < 3 in ≥ 5% of the samples
Non-cg probes
Ambiguous genomic locations (cross-hybridization) as provided by Zhou et al.¹⁴ (not to be removed for epigenetic clock analyses)
SNPs of the corresponding population provided by Zhou et al.¹⁴ (not to be removed for epigenetic clock analyses)
Sex chromosomes (unless they are needed for specific analyses, and also not to be removed for epigenetic clock analyses)
Optionally: Non-variable probes (only if required for specific reasons)

5 Normalization

Bisulfite conversion of DNA and other experimental steps introduce assay variability and batch effects. Technical variation should be reduced as much as possible. Illumina 450K and EPIC BeadChips use two fluorescent dyes (Cy3-green and Cy5-red) and two different probe types (Infinium I and Infinium II), leading to technical biases which should be corrected. Background noise also needs to be removed. To this aim, negative control probes and out-of-band (OOB) probes are present on Illumina’s arrays.

Many different normalization methods are available and the normalization method to be applied depends on each dataset, the research question and the downstream analyses to be performed. Usually, suitable normalization methods are performed (on either Beta- or M-values as required by the method), and their results are compared to each other. Then, the quality assessment steps listed in section 3.4 are repeated to help to choose the most appropriate method(s) for the specific dataset and research question. Individual steps from different normalization methods can also be combined and performed sequentially to obtain optimal results.

For example, normalization methods that perform (only) specific tasks such as dye bias, probe type bias or background correction may be followed by normalization methods that remove technical variance across arrays. Several normalization methods combine the correction of the above-mentioned technical biases and subsequent global normalization including both within- and across-sample normalization to remove unwanted technical variance. Across-sample normalization may not always be suitable¹⁵. A few methods that only apply within-sample normalization are available such as for example PBC and ssNoob. These normalization methods are particularly useful for datasets where the global distribution of DNA methylation levels is expected to differ significantly across samples¹⁵ and for projects where the normalization of single samples needs to be reproducible independently from the rest of the samples, or where new additional samples will be added to the dataset over time⁶.

Illumina¹ (available in minfi):
reverse engineered Illumina GenomeStudio¹⁰ control normalization with background subtraction
Lumi¹⁶ (available in methylumi):
background correction, variance stabilization and normalization
PBC¹⁷ (Peak-Based Correction, available in ChAMP):
intra-array probe type bias correction based on global profiles
Quantile^1,18 (available in minfi):
global quantile normalization performing both sample normalization and probe type correction
Quantile stratified^1,18 (available in minfi):
as above, but with quantile normalization performed within genomic region strata
SWAN¹⁹ (Subset-quantile Within Array Normalization, available in ChAMP):
reduces technical variation within and between arrays
BMIQ²⁰ (Beta-MIxture Quantile normalization, available in ChAMP):
intra-array normalization which adjusts for probe type bias
Noob²¹(Normal-exponential Out-Of-Band, available in minfi):
background correction method with dye-bias normalization
Data-driven normalization²² (available in wateRmelon):
background correction, between-array quantile normalization of methylated and unmethylated signal intensities and the two probe types separately (for ‘dasen’ normalization, for example)
FunNorm²³ (Functional Normalization, available in ChAMP):
between-array quantile normalization of methylated and unmethylated signal intensities and the two probe types separately, uses control probes to remove unwanted technical variation
ENmix¹¹(Exponential-Normal mixture signal intensity background correction, ENmix package):**
background correction, probe type (RCP²⁴, Regression on Correlated Probes) and dye bias correction (RELIC²⁵, Regression on Logarithm of Internal Control probes), inter-array normalization
ssNoob⁶ (singe-sample Normal-exponential out-of-band, available in minfi):
intra-array normalization suitable for incremental preprocessing of individual samples and when integrating data from multiple generations of Illumina microarrays
SeSAMe²⁶(SEnsible Step-wise Analysis of Methylation data, SeSAMe package):
probe quality masking with pOOBAH (p-value with Out-Of-Band probes for Array Hybridization), bleed-through correction in background subtraction, non-linear dye bias correction, control for bisulfite conversion, reduction of inter- and intra-array technical variation

In our experience, both BMIQ and Funnorm together with ENmix's probe type correction often give good results with successful reduction of technical variation and correction for background and probe type bias. Dye-bias normalization can be achieved with Noob, for example.

The mentioned methods provide appropriate results for analyses of global DNA methylation and variability patterns, rank-based approaches, site- and region-based analysis of differential mean and variability of DNA methylation, and epigenetic clock analyses. Probe type bias correction is particularly important for the assessment of global DNA methylation patterns, clusterings, rank-based approaches and region-based analyses. The best possible removal of all technical variation is crucial under all circumstances, but even more critical for analyses of biological variation such as epigenetic drift and differential variability.

SWAN works well when no global shifts in DNA methylation are expected while BMIQ is better suited when significant global differences are present between the phenotypes of interest (see above). This can be the case when comparing normal samples with samples of a cancer type in which global hypomethylation occurs, for example. The quantro method¹⁵ can aid the decision by assessing such global differences.

Quantile and quantile stratified normalization are less well suited for DNA methylation analyses and can introduce additional bias. Illumina’s GenomeStudio normalization method does not reduce technical variation^27,28. It can even introduce additional variability²⁹. FunNorm and BMIQ can sometimes also lead to biased distributions, in particular for DNA methylation M-values. SWAN, BMIQ and PBC have been shown to outperform other methods^27,29,30. PBC can lead to discontinuities in probe type II density distributions under certain circumstances³⁰ and might thus not always be the best suitable approach. All principally suitable methods should be applied to the data, and their results should be assessed carefully (see above).

6 Batch-effect correction

Useful methods to detect batch effects and biological confounders are Surrogate Variable Analysis (SVA) implemented in the sva package³¹ and Singular Value Decomposition (SVD)³² available in ChAMP, as well as the pcrplot() function of ENMix which uses a linear regression approach but might work less well than SVD for categorical variables. These methods assess the impact of significant components of variation present in a dataset. Unknown sources of variation can be detected, and importantly, the contribution of all available technical and biological covariates can and should be assessed. One 450K or EPIC BeadChip accommodates multiple samples (n=12 and n=8, respectively) which can cause a batch effect associated with array ID (‘Sentrix ID’). The arrays’ position on the BeadChip (row and column, ‘Sentrix Position’) also impacts the data. If any significant batch effects or unwanted confounders are observed, the data can be corrected for these effects by the application of ComBat³³ as long as they are not directly confounded with the phenotype of interest.

ComBat is the most widely used method for batch effect correction. It is implemented in sva, minfi, ChAMP and RnBeads and usually works best on DNA methylation M-values. However, sometimes better results are achieved when applying ComBat to Beta-values instead. Thus, it can be useful to perform ComBat on both M-values and Beta-values and assess the results with the help of SVA/SVD and the quality assessment plots described in section 3.4 to compare them to each other.

7 Subpopulation correction

When working with composite samples such as whole blood, peripheral blood mononuclear cells (PBMCs), or brain tissue, cell composition differences across samples may need to be corrected. The most common reference-based method to assess and correct for subpopulation differences is Houseman’s algorithm³⁴ in combination with a blood cell reference dataset of the six most common white blood cell types³⁵ available in minfi, ChAMP and RnBeads. A newer blood cell reference was published in 2018³⁶. Here, blood cells were isolated from 31 male and six female healthy donors while for the previous reference dataset cells were obtained from six adult males.

Reference-free methods also exist^37–41. The results of the correction should be assessed visually by plotting the cell type distributions per sample, mean cell type distributions per group, and by repeating the quality assessment plots that are listed in section 3.4.

Two interesting new methods to identify the cell types driving differential DNA methylation signal in composite samples such as whole blood have recently been developed: CellDMC (DMC = Differentially Methylated Cytosines)⁴² and TCA (Tensor Composition Analysis)⁴³.

8 Sample filtering

Based on the initial quality assessment, the control probe plots, and the visual inspection of all global plots (densities, PCA, MDS and hierarchical clusterings, see section 3.4) performed before and after each preprocessing step, samples that turn out to be outliers corresponding to multiple measurements and plots might be removed from the dataset for downstream analyses.

To further aid the detection of potential outlier samples, the following functions and algorithms can be used:

detectOutlier() of the R package Lumi
outlyx() of the R package wateRmelon
The locFDR outlier detection method as described in Hannum et al., Mol Cell, 2013⁴⁴

9 Beta- and M-value conversion

As mentioned, some methods require Beta-values as input. M-values can be converted into Beta-values and vice versa using the following functions implemented in RnBeads:

  beta2mval <- function(betas, epsilon = 0.00001) {
    if (!is.numeric(betas)) {
      stop("invalid value for betas")
    }
    if (!(is.numeric(epsilon) && length(epsilon) == 1 && (!is.na(epsilon)))) {
      stop("invalid value for epsilon")
    }
    if (epsilon < 0 || epsilon > 0.5) {
      stop("invalid value for epsilon; expected 0 <= epsilon <= 0.5")
    }
    betas[betas < epsilon] <- epsilon
    betas[betas > (1 - epsilon)] <- 1 - epsilon
    return(log2(betas / (1 - betas)))
  }

  mval2beta <- function(mvals) {
    if (!is.numeric(mvals)) {
      stop("invalid value for mvals")
    }
    intmf<-2^(mvals)
    return(intmf/(intmf+1))
  }

10 Additional notes

M-values are used for all downstream statistical analyses such as differential mean methylation or analyses of variability and epigenetic drift. Beta-values are only used for the visualization of DNA methylation patterns.

An exception are epigenetic clocks to perform epigenetic age estimations^45–48 and other epigenetic predictors. Most of these methods require Beta-values as input.

References

Aryee, M. J. et al. Minfi: a flexible and comprehensive Bioconductor package for the analysis of Infinium DNA methylation microarrays. Bioinformatics 30, 1363–9 (2014).
Morris, T. J. et al. ChAMP: 450k Chip Analysis Methylation Pipeline. Bioinformatics 30, 428–430 (2013).
Tian, Y. et al. ChAMP: updated methylation analysis pipeline for Illumina BeadChips. Bioinformatics 2014–6 (2017). doi:10.1093/bioinformatics/btx513
Assenov, Y. et al. Comprehensive analysis of DNA methylation data with RnBeads. Nat Methods 11, (2014).
Mueller, F. et al. RnBeads 2.0: comprehensive analysis of DNA methylation data. Genome Biol 20, 55 (2019).
Fortin, J. P., Triche, T. J. & Hansen, K. D. Preprocessing, normalization and integration of the Illumina HumanMethylationEPIC array with minfi. Bioinformatics 33, 558–60 (2017).
Illumina Inc. BeadArray Controls Reporter: Software Guide. (2015). doi:10.17485/IJST/2016/V9I47/94892
Heiss, J. A. & Just, A. C. Identifying mislabeled and contaminated DNA methylation microarray data: An extended quality control toolset with examples from GEO. Clin Epigenetics 10, (2018).
Mancuso, F. M., Montfort, M., Carreras, A., Alibés, A. & Roma, G. HumMeth27QCReport: An R package for quality control and primary analysis of Illumina Infinium methylation data. BMC Res Notes 4, 546 (2011).
Illumina Inc. GenomeStudio Methylation Module v1.8 User Guide. Illumina (2008).
Xu, Z., Niu, L., Li, L. & Taylor, J. A.: ENmix A novel background correction method for Illumina HumanMethylation450 BeadChip. Nucleic Acids Res 44, e20 (2016).
Lehne, B. et al. A coherent approach for analysis of the Illumina HumanMethylation450 BeadChip improves data quality and performance in epigenome-wide association studies. Genome Biol 16, 37 (2015).
Fortin J., Fertig E. J. & Hansen K. D. shinyMethyl: interactive quality control of Illumina 450k DNA methylation arrays in R. F1000Research, 3, 175 (2014).
Heiss, J. A. & Just, A. C. Improved filtering of DNA methylation microarray data by detection p values and its impact on downstream analyses. Clin Epigenetics 11, 15 (2019).
Zhou, W., Laird, P. W. & Shen, H. Comprehensive characterization, annotation and innovative use of Infinium DNA methylation BeadChip probes. Nucleic Acids Res 45, e22 (2017).
Hicks, S. & Irizarry, R. Quantro: a Data-Driven Approach To Guide the Choice of an Appropriate Normalization Method. Genome Biol 16, 117 (2015).
Du, P., Kibbe, W. A. & Lin, S. M. lumi: A pipeline for processing Illumina microarray. Bioinformatics 24, 1547–8 (2008).
Dedeurwaerder, S. et al. Evaluation of the Infinium Methylation 450K technology. Epigenomics 3, 771–784 (2011).
Touleimat, N. & Tost, J. Complete pipeline for Infinium® Human Methylation 450K BeadChip data processing using subset quantile normalization for accurate DNA methylation estimation. Epigenomics 4, 325–41 (2012).
Maksimovic, J., Gordon, L. & Oshlack, A. SWAN: Subset-quantile Within Array Normalization for Illumina Infinium HumanMethylation450 BeadChips. Genome Biol 13, R44 (2012).
Teschendorff, A. E. et al. A beta-mixture quantile normalization method for correcting probe design bias in Illumina Infinium 450 k DNA methylation data. Bioinformatics 29, 189–96 (2013).
Triche, T. J., Weisenberger, D. J., Van Den Berg, D., Laird, P. W. & Siegmund, K. D. Low-level processing of Illumina Infinium DNA Methylation BeadArrays. Nucleic Acids Res 41, e90 (2013).
Pidsley, R. et al. A data-driven approach to preprocessing Illumina 450K methylation array data. BMC Genomics 14, 293 (2013).
Fortin, J.-P. et al. Functional normalization of 450k methylation array data improves replication in large cancer studies. Genome Biol 15, 503 (2014).
Niu, L., Xu, Z. & Taylor, J. A. RCP: A novel probe design bias correction method for Illumina Methylation BeadChip. Bioinformatics 32, 2659–63 (2016).
Xu, Z., Langie, S. A. S., De Boever, P., Taylor, J. A. & Niu, L. RELIC: A novel dye-bias correction method for Illumina Methylation BeadChip. BMC Genomics 18, 4 (2017).
Zhou, W., Triche, T. J., Laird, P. W. & Shen, H. SeSAMe: reducing artifactual detection of DNA methylation by Infinium BeadChips in genomic deletions. Nucleic Acids Res 46, e123 (2018).
Marabita, F. et al. An evaluation of analysis pipelines for DNA methylation profiling using the illumina humanmethylation450 BeadChip platform. Epigenetics 8, 333–46 (2013).
Yousefi, P. et al. Considerations for normalization of DNA methylation data by Illumina 450K BeadChip assay in population studies. Epigenetics 8, 1141–52 (2013).
Wu, M. C. et al. A systematic assessment of normalization approaches for the Infinium 450K methylation platform. Epigenetics 9, 318–29 (2014).
Wang, T. et al. A systematic study of normalization methods for Infinium 450K methylation data using whole-genome bisulfite sequencing data. Epigenetics 10, 662–9 (2015).
Leek, J. T., Johnson, W. E., Parker, H. S., Jaffe, A. E. & Storey, J. D. The SVA package for removing batch effects and other unwanted variation in high-throughput experiments. Bioinformatics 28, 882–3 (2012).
Teschendorff, A. E. et al. An epigenetic signature in peripheral blood predicts active ovarian cancer. PLoS One 4, e8274 (2009).
Johnson, W. E., Li, C. & Rabinovic, A. Adjusting batch effects in microarray expression data using empirical Bayes methods. Biostatistics 8, 118–27 (2007).
Houseman, E. A. et al. DNA methylation arrays as surrogate measures of cell mixture distribution. BMC Bioinformatics 13, (2012).
Reinius, L. E. et al. Differential DNA methylation in purified human blood cells: implications for cell lineage and studies on disease susceptibility. PLoS One 7, (2012).
Salas, L. A. et al. An optimized library for reference-based deconvolution of whole-blood biospecimens assayed using the Illumina HumanMethylationEPIC BeadArray. Genome Biol 19, 64 (2018).
Houseman, E. A. et al. Reference-free deconvolution of DNA methylation data and mediation by cell composition effects. BMC Bioinformatics 17, 259 (2016).
Kaushal, A. et al. Comparison of different cell type correction methods for genome-scale epigenetics studies. BMC Bioinformatics 18, 216 (2017).
Zou, J., Lippert, C., Heckerman, D., Aryee, M. & Listgarten, J. Epigenome-wide association studies without the need for cell-type composition. Nat Methods 11, 309–11 (2014).
Rahmani, E. et al. Sparse PCA corrects for cell type heterogeneity in epigenome-wide association studies. Nat Methods 13, 443–445 (2016).
Rahmani, E. et al. Correcting for cell-type heterogeneity in DNA methylation: a comprehensive evaluation. Nat Methods 14, 218–219 (2017).
Zheng, S. C., Breeze, C. E., Beck, S. & Teschendorff, A. Identification of differentially methylated cell-types in Epigenome-Wide Association Studies. Nat Genet Methods 15, 1059–66 (2018).
Rahmani, E. et al. Cell-type-specific resolution epigenetics without the need for cell sorting or single-cell biology. Nat Commun 10, 3417 (2019).
Hannum, G. et al. Genome-wide Methylation Profiles Reveal Quantitative Views of Human Aging Rates. Mol Cell 49, 359–67 (2013).
Horvath, S. DNA methylation age of human tissues and cell types. Genome Biol 14, R115 (2013).
Levine, M. E. et al. An epigenetic biomarker of aging for lifespan and healthspan. Aging (Albany NY) 10, 573–91 (2018).
Horvath, S. et al. Epigenetic clock for skin and blood cells applied to Hutchinson Gilford Progeria Syndrome and ex vivo studies. Aging (Albany NY) 10, 1758–75 (2018).
Lu, A. T. et al. DNA methylation GrimAge strongly predicts lifespan and healthspan. Aging (Albany NY) 11, 303-327 (2019).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

DNA_methylation_preprocessing_summary.md

DNA_methylation_preprocessing_summary.md

DNA Methylation Data Preprocessing Summary

1 Main R packages

2 Data import

3 Quality control

3.1 Initial quality control

3.2 Quality assessment using control probes

3.3 Identification of sample mismatches

3.4 Quality assessment based on all cg probes used for downstream analyses

4 Probe Filtering

5 Normalization

6 Batch-effect correction

7 Subpopulation correction

8 Sample filtering

9 Beta- and M-value conversion

10 Additional notes

References

Files

DNA_methylation_preprocessing_summary.md

Latest commit

History

DNA_methylation_preprocessing_summary.md

File metadata and controls

DNA Methylation Data Preprocessing Summary

1 Main R packages

2 Data import

3 Quality control

3.1 Initial quality control

3.2 Quality assessment using control probes

3.3 Identification of sample mismatches

3.4 Quality assessment based on all cg probes used for downstream analyses

4 Probe Filtering

5 Normalization

6 Batch-effect correction

7 Subpopulation correction

8 Sample filtering

9 Beta- and M-value conversion

10 Additional notes

References