eCLIP is a pipeline designed to identify genomic locations of RNA-bound proteins.
- (paired-end only) Demultiplexes paired-end reads using inline barcodes
- Trims adapters & adapter-dimers with cutadapt
- Maps to repeat elements with STAR and filter
- Maps filtered reads to genome with STAR
- Removes PCR-duplicates with umi_tools (single-end) or with a custom python script (barcodecollapsepe.py)
- (paired-end only) Merges multiple inline barcodes and filters R1 (uses only R2 for peak calling)
- Calls enriched peak regions (peak clusters) with CLIPPER
- Uses size-matched input sample to normalize and calculate fold-change enrichment within enriched peak regions with custom perl scripts (overlap_peakfi_with_bam_PE.pl, compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl)
For a full description (including commandline args), you may refer to the Standard Operating Procedure
Explore the pipeline definition here:
For human datasets, we recommend at least 8 cores (for Clipper) and 32G memory (for STAR). Conservatively, you should expect to have at least 200G in free disk space (this requirement including all inputs, indices, intermediates, and outputs).
- bedtools=2.27.1
- clipper=5d865bb17b2bc6787b4c382bc857119ae917ad59
- cutadapt=1.14
- eclipdemux=0.0.1
- fastqc=0.11.8
- fastq-tools=0.8
- perl=5.10.1
- Statistics::Basic 1.6611
- Statistics::Distributions 1.02
- Statistics::R 0.34
- R=3.3.2
- python=2.7.16
- pysam=0.15.4
- numpy=1.16.5
- seaborn
- samtools=1.6
- star=2.7.6a
- ucsc-tools=377
- umi_tools=1.0.0
Alternatively, you may refer to the dockerRequirements within each CWL document to pull the proper environments for each step.
- cwltool=1.0.20180306140409
- cwltest=1.0.20180413145017
- galaxy-lib=17.9.3
- toil=3.15.0a1 (or higher, this is the minimum version required for Torque/PBS-based clusters)
(make sure to place this in a location with plenty of space!):
- Sequencing data (in fastq format). You may download our reference RBFOX2 HepG2 raw data here: RBFOX2
- Genome STAR index directory (fasta files can be downloaded from UCSC; hg19)
- Repeat element STAR index directory (We now recommend using the most up-to-date RepBase files corresponding to your species of interest, otherwise you may email us for an example human-based reference).
- FASTA file containing barcodes for demultiplexing reads
- For paired-end data, use yeolabbarcodes_20170101.fasta
- For single-end data, use either a_adapters.fasta or the
InvRNA*_adapters.fasta
files, described below.
- chrom.sizes file (tabbed file containing chromosome name and length, can be downloaded from UCSC; hg19)
- Manifest YAML or JSON file describing paths of the above data
- For paired-end data, use this template
- For single-end data, use this template
- Blacklist file containing potential artifact regions. These have been manually curated using ENCODE3 datasets and can be found here:
STAR indices:
speciesGenomeDir:
class: Directory
path: /path/to/stargenome
repeatElementGenomeDir:
class: Directory
path: /path/to/repeatelement
CLIPPER params:
species: hg19 # for supported species, see clipper docs
UMI & barcode params:
randomer_length: "5" # (Paired-end only) length of the UMI assigned to each read. This may differ depending on the size of your randomer sequence.
barcodesfasta: # (Paired-end only) This is a FASTA formatted file containing the barcodes we will use to demultiplex our FASTQ's:
class: File
path: /path/to/barcodes
Blacklist file:
blacklist_file: # (Single-end only) This is a BED6 file containing regions that will be excluded from the final peak outputs. Typically comprised of artifact regions such as tRNA/snoRNA/etc.)
class: File
path: /path/to/blacklist
The following YAML block describes the location paths of the forward (read1), reverse (read2) reads, and the barcodes required to demultiplex these reads for each sample.
Barcode names must match those described in the above barcodes.fasta file!
(For example, if you are using our standard paired-end barcodes here, make sure the barcodeids are one of: A01, A03, A04, B06, C01, D8f, F05, G07, X1A, X1B, X2A, X2B, or NIL for "inputs". Single-end protocols may not have inline barcodes. If this is the case, you will use the a_adapters.fasta. Else, SE protocols with inline barcodes will need the fasta file corresponding to the barcode in question:
- InvRNA1_adapters.fasta
- InvRNA2_adapters.fasta
- InvRNA3_adapters.fasta
- InvRNA4_adapters.fasta
- InvRNA5_adapters.fasta
- InvRNA6_adapters.fasta
- InvRNA7_adapters.fasta
- InvRNA8_adapters.fasta
We're showing two samples (2 replicates each) for a paired-end experiment described in this space.
Each sample will be defined as indicated below each name:
field.
Make sure these names are unique per sample! They (and dataset name above)
are used to determine the filename prefixes and non-unique IDs will override each other.
PE Data:
samples:
-
- ip_read:
name: rep1_clip
barcodeids: [A01, B06]
read1:
class: File
path: /path/to/clip.fastq.gz
read2:
class: File
path: /path/to/clip.fastq.gz
- input_read:
name: rep1_input
barcodeids: [NIL, NIL]
read1:
class: File
path: /path/to/clip.fastq.gz
read2:
class: File
path: /path/to/clip.fastq.gz
-
- ip_read:
name: rep2_clip
barcodeids: [C01, D8f]
read1:
class: File
path: /path/to/clip.fastq.gz
read2:
class: File
path: /path/to/clip.fastq.gz
- input_read:
name: rep2_input
barcodeids: [NIL, NIL]
read1:
class: File
path: /path/to/clip.fastq.gz
read2:
class: File
path: /path/to/clip.fastq.gz
SE Data:
samples:
-
- ip_read:
name: rep1_clip
read1:
class: File
path: /path/to/clip.fastq.gz
adapters:
class: File
path: inputs/InvRNA1_adapters.fasta
- input_read:
name: 4020_INPUT1
read1:
class: File
path: /path/to/clip.fastq.gz
adapters:
class: File
path: inputs/InvRNA5_adapters.fasta
-
- ip_read:
name: 4020_CLIP1
read1:
class: File
path: /path/to/clip.fastq.gz
adapters:
class: File
path: inputs/InvRNA1_adapters.fasta
- input_read:
name: 4020_INPUT1
read1:
class: File
path: /path/to/clip.fastq.gz
adapters:
class: File
path: inputs/InvRNA5_adapters.fasta
Assuming you have:
- Downloaded and generated the relevant STAR indices
- Installed CWL
- Installed Docker (or alternatively, verified relevant binaries are located in your $PATH)
- Ensured the relevant files are locatable in your $PATH (eclip/bin:eclip/cwl:eclip/wf)
You can run the pipeline using one of our wrappers in (wf/):
./paired_end_clip.yaml
./single_end_clip.yaml
Or, run the workflow using cwl in its native context:
cwltool wf_get_peaks_pe_scatter.cwl paired_end_clip.yaml
cwltool wf_get_peaks_se_scatter.cwl single_end_clip.yaml
Running on a complete dataset takes about a day for human ENCODE data (24 hours), so sit back and relax by reading the rest of this README.
Input-normalized peaks will contain candidate binding regions.
For Single-end eCLIP, you can expect outputs to follow this filestructure:
Dataset: "myRBP" name: "IP" | eCLIP-0.2.2 | eCLIP-0.3.0+ |
---|---|---|
Cutadapt x1 metrics | myRBP.IP.umi.r1Tr.metrics |
myRBP.IP.umi.r1.fqTr.metrics |
Cutadapt x2 metrics | myRBP.IP.umi.r1TrTr.metrics |
myRBP.IP.umi.r1.fqTrTr.metrics |
Demuxed + adapter trimmed reads | myRBP.IP.umi.r1TrTr.fq |
myRBP.IP.umi.r1TrTr.fq |
Repetitive element filtered reads | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.fq |
myRBP.IP.umi.r1.fq.repeat-unmapped.sorted.fq.gz |
STAR metrics (repeat aligned) | myRBP.IP.umi.r1TrTr.sorted.STARLog.final.out |
myRBP.IP.umi.r1.fqTrTr.sorted.STARLog.final.out |
Unique genome aligned reads (sorted) | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.bam |
myRBP.IP.umi.r1.fq.genome-mappedSoSo.bam |
STAR metrics (genome aligned) | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARLog.final.out |
myRBP.IP.umi.r1.fq.repeat-unmapped.sorted.STARLog.final.out |
PCR duplicate removed aligned reads | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.bam |
myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.bam |
CLIPper peaks | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.bed |
myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed |
Input-normalized peaks | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.normed.compressed.bed |
myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.bed |
RPM-normalized BigWig files | myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.norm.*.bw |
myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.norm.*.bw |
Blacklist-filtered peaks | myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.bed |
|
Blacklist-filtered bigBeds | myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.fx.bb |
|
Blacklist-filtered narrowPeaks | myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.narrowPeak |
|
made with: https://www.tablesgenerator.com/markdown_tables |
For Paired-end eCLIP:
eCLIP 0.2.x | eCLIP GATK | eCLIP 0.3+ | |
---|---|---|---|
Demuxed + adapter trimmed reads | *.CLIP.barcode.r1TrTr.fq |
RBFOX2-204-CLIP_S1_R*.A01_204_01_RBFOX2.adapterTrim.round2.fastq.gz |
204.01_RBFOX2.A01.r*.fqTrTr.fqgz |
Repetitive element filtered reads | *.CLIP.barcode.r1.fqTrTr.sorted.STARUnmapped.out.sorted.fq |
RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rep.bamUnmapped.out.mate* |
204.01_RBFOX2.A01.r*.fqTrTr.repeat-unmapped.sorted.fq.gz |
Unique genome aligned reads | *.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.bam |
RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rmRep.bam |
204.01_RBFOX2.A01.r1.fq.genome-mappedSo.bam |
PCR duplicate removed aligned reads | *.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.bam |
RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rmRep.rmDup.sorted.bam |
204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.bam |
Barcode merged alignments | *.CLIP.barcode.r1.fqTrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.merged.r2.bam |
204_01_RBFOX2.merged.r2.bam |
204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.bam |
CLIPper peaks | *.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.bed |
204_01_RBFOX2.merged.r2.peaks.bed |
204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.bed |
Input-normalized peaks | *.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.normed.compressed.bed |
204_01.basedon_204_01.peaks.l2inputnormnew.bed.compressed.bed |
204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.normed.compressed.bed |
- When going through the merged BAM file results, I can only find files with only one of the paired barcodes (e.g. A01 of A01/B06). Is this normal? Yes,
*.merged*.bam
indicates that both barcodes have been merged, I just use the first as a prefix namespace for the next step. - Regarding the narrowPeak files - The scores in the 5th column are almost always equal to 1000 or 200, where do these values come from and what do they represent? 200/1000 flags each peak as significant (1000) or not (200) visually on the genome browser
- Regarding
Van Nostrand, Eric L., et al. "Robust, Cost-Effective Profiling of RNA Binding Protein Targets with Single-end Enhanced Crosslinking and Immunoprecipitation (seCLIP)." mRNA Processing. Methods Mol Biol. 2017;1648:177-200.
Van Nostrand, E.L., Pratt, G.A., Shishkin, A.A., Gelboin-Burkhart, C., Fang, M.Y., Sundararaman, B., Blue, S.M., Nguyen, T.B., Surka, C., Elkins, K. and Stanton, R. "Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)." Nature methods 13.6 (2016): 508-514.
Amstutz, Peter; Crusoe, Michael R.; Tijanić, Nebojša; Chapman, Brad; Chilton, John; Heuer, Michael; Kartashov, Andrey; Leehr, Dan; Ménager, Hervé; Nedeljkovich, Maya; Scales, Matt; Soiland-Reyes, Stian; Stojanovic, Luka (2016): Common Workflow Language, v1.0. figshare. https://doi.org/10.6084/m9.figshare.3115156.v2 Retrieved: 22 13, May 11, 2017 (GMT)
Kurtzer GM, Sochat V, Bauer MW (2017): Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459. https://doi.org/10.1371/journal.pone.0177459