Skip to content

Latest commit

 

History

History
139 lines (96 loc) · 8.79 KB

Enrichment.md

File metadata and controls

139 lines (96 loc) · 8.79 KB

Table of contents

Introduction

In this section, the gene lists and genomic regions from the splitting process are overlapped with various databases. Five standardized columns are made for each database:

  • tgt: the target against which the overlap is computed
  • tot_tgt: total number of target entries
  • tot_da: total number of entries in the DAS (Differential Analysis Subset)
  • ov_da: overlap of entries from the DAS and the target
  • tot_nda: total number of entries not in the DAS
  • ov_nda: overlap of entries not in the DAS and with entries in the target

Note: Entries not in the DAS refers to all genes or detected regions (macs2 peaks or promoter) detected in the assay that are not present in the DAS.

These standardized columns are then used in subsequent process to compute pvalues and making figures and tables. The columns that are unique to a particular analysis are described in the corresponding process.

The key of each DAS is then augmented by adding the EC (Enrichment Category) variable. Thus the key becomes: ${ET}__${PA}__${FC}__${TV}__${COMP}__{EC}.
With, as defined in the splitting process, the variables:

  • ET: Experiment Type
  • PA: DAR Peak Annotation
  • FC: Fold Change type
  • TV: Theshold Value(s)
  • COMP: Comparison
  • EC: Enrichment Category

And EC can be any of these:

  • func_anno_{BP,MF,CC,KEGG}: Ontologie databases GO_BP, GO_CC, GO_MF and KEGG
  • CHIP: Transcription factor CHIP-Seq profiles
  • chrom_states: Chromatin states from the specified chromatin state file
  • motifs: Transcription factors motifs sequences
  • peaks_self: Genomic regions DASs from the current experiment
  • genes_self: Gene list DASs from the current experiment.

e.g.: key = ATAC__all__down__1000__hmg4_vs_ctl__func_anno_BP__enrich.

Note: Please see the References section for details on how the external databases were downloaded and preprocessed, as well as details on the labels of the targets used in the figures and tables.

For all genomic regions enrichment analysis, the regions not in the DAS are used as a background for computing the significance of the overlaps. While for genes enrichment analysis and option is provided (params.use_nda_as_bg_for_func_anno) to either non DAS genes as a background or all genes in the database.

Enrichment__computing_functional_annotations_overlaps

Description

Overlap of gene lists with functional annotation databases is performed using clusterProfiler. These columns are added to the exported table:

  • tgt_id: the id of the ontology
  • genes_id: the list of enriched genes collapsed with a "/".

Parameters

  • params.do_func_anno_enrichment: enable or disable this process. Default: true.
  • params.use_nda_as_bg_for_func_anno: use non-differentially expressed genes as the background for differentially analysis. If FALSE, all genes in the database are used. Default: 'FALSE'.
  • params.func_anno_databases: which database(s) to query for functional annotation enrichment analysis (KEEG, GO BP, GO CC or GO MF). Options: 'KEGG', 'CC', 'MF', 'BP'. Default: ['BP', 'KEGG'].
  • params.simplify_cutoff: Similarity cutoff to removed redundant go terms. Default: 0.8.

Enrichment__computing_genes_self_overlaps

Description

In this process, all genes sets from DASs of the splitting process are overlapped with each other.

Enrichment__computing_peaks_overlaps

Description

This process takes as input genomic regions (bed files) from various sources and overlap them with genomic regions (bed files) of DASs from the splitting process.
The input genomic regions are:

  • CHIP
  • Chromatin states (hiHMM or ChromHMM)
  • genomic regions of DASs from the splitting process -> for computing self overlap of genomic regions DASs within the experiment.

Parameters

  • params.chromatin_state_1: Chromatin state to use. Options are listed in the references/${specie}/encode_chromatin_states_metadata.csv file. Mandatory. No default.
  • params.chip_ontology: CHIP ontology to use to filter the ENCODE CHIP files. Options are listed in the references/${specie}/available_chip_ontology_groups.txt file and details on the groups can be found in the file references/${specie}/encode_chip_metadata.csv file. Default: 'all'.

Enrichment__computing_motifs_overlaps

Description

This process uses HOMER to compute the overlap of genomic regions of DASs in CIS-BP motifs.

Parameters

  • params.do_motif_enrichment: enable or disable this process. Default: true.
  • params.homer__nb_threads: number of threads used by Bowtie2. Default: 6.

Outputs

  • Homer output folder: Processed_Data/3_Enrichment_Analysis/motifs__raw/${key}

Enrichment__reformatting_motifs_results

Description

Homver results tables are formatted in R to add the standardized columns necessary for computing pvalues.

Enrichment__computing_enrichment_pvalues

Description

This process takes all overlap processes, estimates significance and format tables.

Hypergeometric minimum-likelihood two-sided p-values (pval) are obtained with a two-sided Fisher's Exact Test in R. Two-sided tests are recommended for GO enrichment anlaysis since in most cases both enrichment and depletion can be biologically meaningful (see reference).
Log2 odd ratios (L2OR) is the log2 of the test's estimate. Pvalues are then adjusted (padj) using Benjamini and Hochberg's False Discovery Rate.

The pt_da and pt_nda columns are added to indicate the percentage of overlap of the target with the Differential Analysis subset (DA) (pt_da) or non-DA (pt_nda) entries.
A gene enrichment type column is added for functional annotation enrichment, to specify the gene database used.
Results are sorted by adjusted pvalues (padj, descending order) and overlap of DA results (ov_da, ascending order).

Finally, each elements of the key (ET, PA, FC, TV, COMP) are split in a separate column in the table as well as the target (tgt).

Parameters

  • params.motifs_test_type: The test to use for motif inputs. If 'Binomial' a two-sided binomial test is performed instead of the two-sided Fisher's Exact Test. Options: 'binomial' or 'fischer' (any value). Default: 'binomial'.

Outputs

  • Overlap tables:
    • Tables_Individual/3_Enrichment_Analysis/${EC}/${key}__enrich.{csv,xlsx}
    • Tables_Merged/3_Enrichment_Analysis/${EC}.{csv,xlsx},