Skip to content
mette.bentsen edited this page Feb 17, 2023 · 4 revisions

Configfile

To run the snakemake pipeline, we need to set up a configfile containing information on the input data and parameters for the analysis in yaml-format. Motifs can be downloaded from JASPAR first. The required keys/value pairs are:

data:
  Bcell: [data/Bcell_chr4.bam]                                   #list of .bam-files
  Tcell: [data/Tcell_chr4_1.bam, data/Tcell_chr4_2.bam]  #list of .bam-files
  
run_info:
  organism: human                        #mouse/human
  fasta: data/genome_chr4.fa.gz          #.fasta-file containing organism genome
  blacklist: data/blacklist_chr4.bed     #.bed-file containing blacklisted regions
  gtf: data/genes_chr4.gtf               #.gtf-file for annotation of peaks
  motifs: data/individual_motifs         #directory containing motifs (single files in MEME/JASPAR/PFM format)
  output: test_output                    #output directory 

Additionally, it is possible to set parameters for single methods, which will be appended to the command-line call for the given tool. Examples are:

macs: "--nomodel --shift -100 --extsize 200 --broad"
atacorrect: ""
footprinting: ""
bindetect: "--prefix immune"

If the additional key/values for tools are not set, the default is always "" (empty string).

Running the pipeline

After setting all necessary paths in the configfile, Snakemake can be run using:

$ conda activate TOBIAS_ENV
$ snakemake --configfile TOBIAS_example.config --cores [number of cores] --use-conda --keep-going

Pipeline workflow

The pipeline extracts all needed information from its config file. In the preprocessing part, peaks are called for each sample. To ensure that a condition comparison is possible, these peaks are merged across conditions. Afterwards, peaks are annotated using UROPA. Additionally, the condition .bam files are merged to be used in the subsequent analysis. In the analysis part, the condition .bam file(s) are be corrected by ATACorrect and investigated for binding affinity by FootprintScores. Using the PWM motifs as specified in the config, motif occurrences are determined and assigned with scores related to the FootprintScores by BINDetect. Beside textual results, the pipeline creates various visualizations such as aggregated footprint plots for comparison within and between conditions.

Output

The pipeline run will create a folder-structure containing results from each part of the analysis, with the upper folder being the output_folder given in the configfile:

<Output_folder>/

  • mapping/
    • Joined bamfile (per condition)
  • motifs/
    • all_motifs.txt: All input motifs joined to pfm format
  • peak_calling/
    • Peak calling results for each replicate + the joined peak sets
  • peak_annotation/
    • Annotation of merged peaks using UROPA.
  • bias_correction/
    • output from TOBIAS ATACorrect (per condition)
  • footprinting/
    • Output .bigwigs from TOBIAS FootprintScores (per condition)
  • TFBS/
    • Output from TOBIAS BINDetect
  • overview/
    • Overview files (such as joined pdf files and bindetect overview)
  • logs/
    • Individual logfiles for each tool

Detailed output description

peak_annotation

This directory includes results of the annotation with UROPA. Besides the original output files of the annotation tool, there is the all_merged_annotated.bed file. This is the processed output of UROPAs finalhits, including only a subset of the original columns.

TFBS directory

This directory includes results of TOBIAS BINDetect, PlotAggregate and PlotHeatmap. An overview in .txt and .xlsx format, e.g. NFATC2_MA0152.1_overview.txt, is located in the top directory of each TF. All overview tables are build up like this:

TFBS_chr TFBS_start TFBS_end TFBS_name TFBS_score TFBS_strand peak_chr peak_start peak_end peak_id feature feat_start feat_end feat_strand feat_anchor distance gene_biotype gene_id gene_name Tcell_score Bcell_score Tcell_bound Bcell_bound Tcell_Bcell_log2fc
chr4 49511177 49511184 NFATC2_MA0152.1 8,51223 - chr4 49510270 49511264 Tcell,Bcell gene 49507764 49508806 - start 1961 processed_pseudogene ENSG00000250769 AC119751.5 18,09185 1,89763 1 1 3,16718
chr4 99507937 99507944 NFATC2_MA0152.1 8,51223 - chr4 99507736 99508445 Tcell,Bcell gene 99511004 99542303 + start 2914 protein_coding ENSG00000138813 C4orf17 11,99971 1,27645 1 1 3,10753
chr4 40114339 40114346 NFATC2_MA0152.1 8,51223 + chr4 40114197 40114397 Tcell NA NA NA NA NA NA NA NA NA 1,75915 0,08972 1 0 3,09833
chr4 120549506 120549513 NFATC2_MA0152.1 8,51223 - chr4 120549407 120549732 Tcell NA NA NA NA NA NA NA NA NA 1,77557 0,10178 1 0 3,03405
chr4 122456529 122456536 NFATC2_MA0152.1 8,51223 - chr4 122456311 122456554 Tcell gene 122451470 122456725 - start 293 protein_coding ENSG00000109471 IL2 3,31647 0,29002 1 0 3,03364

The first six columns represent TFBScan results including the genomic location information, the name consisting of the TF name and id, the sequence match score and the strand on which the match was found. The middle columns represent the peak annotation information including their genomic location and annotation information, e.g. gene name, id and biotype. Sites without annotation are included but denoted with "NA". The last columns (dependent on the number of conditions) represent footprinting information. For the example data set, there were two conditions. Thus, there are five columns including one footprint score and one bound information for each condition, and one log2 fold change column. In the bound columns, a 1 indicates that this binding site was identified as bound for this condition. The log2 fold change is calculated in the direction as the conditions are specified in the config. In the example config, the order of conditions was [Tcell, Bcell], and therefore, a positive fold change indicates an increase in activity towards Tcell and vice versa.

Further textual and visual results are partitioned into a beds and plots sub directories. The beds directory includes the individual transcription factor binding sites separated for each condition to bound and unbound sites. As an example, given the example config file, the /bed/-directory for the transcription factor NFATC2 has the following files:

  • NFATC2_MA0152.1_all.bed - all binding sites within peaks
  • NFATC2_MA0152.1_Bcell_bound.bed - bound NFATC2 sites in Bcell condition
  • NFATC2_MA0152.1_Bcell_unbound.bed - unbound NFATC2 sites in Bcell condition
  • NFATC2_MA0152.1_Tcell_bound.bed - bound NFATC2 sites in Tcell condition
  • NFATC2_MA0152.1_Tcell_unbound.bed - bound NFATC2 sites in Tcell condition

These files are all in standard .bed-format and can be visualized with any genome browser, like IGV. But the format is similar to the overview files. The _all.bed file represents the TFBScan results plus the peak annotation details. The bound/unbound files show additionally the binding score of the represented condition.

The plots directory includes the PlotAggregate and PlotHeatmap results. There are two major visualizations per TF, a heatmap and an aggregated signal plot. The heatmap plot shows the signal of all TFBS and the aggregated plot summarizes these observations. If there are multiple conditions, there are additionally comparison plots for both, the heatmaps and the aggregated plots. Example files from this folder are:

  • NFATC2_MA0152.1_aggregate_comparison_all.pdf - Aggregated signal across all sites for all conditions
  • NFATC2_MA0152.1_aggregate_comparison_bound.pdf - Aggregated signal across bound sites for all conditions
  • NFATC2_MA0152.1_heatmap_comparison.pdf - Heatmaps of signals across bound/unbound sites for all conditions
  • NFATC2_MA0152.1_Tcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the T cell condition
  • NFATC2_MA0152.1_Bcell_aggregate.pdf - Aggregated signals across all/bound/unbound for uncorrected/expected/corrected signals within the B cell condition