eCLIP

eCLIP is a pipeline designed to identify genomic locations of RNA-bound proteins.

Description/methods:

(paired-end only) Demultiplexes paired-end reads using inline barcodes
Trims adapters & adapter-dimers with cutadapt
Maps to repeat elements with STAR and filter
Maps filtered reads to genome with STAR
Removes PCR-duplicates with umi_tools (single-end) or with a custom python script (barcodecollapsepe.py)
(paired-end only) Merges multiple inline barcodes and filters R1 (uses only R2 for peak calling)
Calls enriched peak regions (peak clusters) with CLIPPER
Uses size-matched input sample to normalize and calculate fold-change enrichment within enriched peak regions with custom perl scripts (overlap_peakfi_with_bam_PE.pl, compress_l2foldenrpeakfi_for_replicate_overlapping_bedformat.pl)

For a full description (including commandline args), you may refer to the Standard Operating Procedure

Explore the pipeline definition here:

Installation:

Hardware recommendations:

For human datasets, we recommend at least 8 cores (for Clipper) and 32G memory (for STAR). Conservatively, you should expect to have at least 200G in free disk space (this requirement including all inputs, indices, intermediates, and outputs).

The pipeline has been tested using the following softwares and their versions:

bedtools=2.27.1
clipper=5d865bb17b2bc6787b4c382bc857119ae917ad59
cutadapt=1.14
eclipdemux=0.0.1
fastqc=0.11.8
fastq-tools=0.8
perl=5.10.1
- Statistics::Basic 1.6611
- Statistics::Distributions 1.02
- Statistics::R 0.34
R=3.3.2
python=2.7.16
- pysam=0.15.4
- numpy=1.16.5
- seaborn
samtools=1.6
star=2.7.6a
ucsc-tools=377
umi_tools=1.0.0

Alternatively, you may refer to the dockerRequirements within each CWL document to pull the proper environments for each step.

Additional pipeline-specific requirements (minimal, one node w/ one or more cores):

cwlref-runner=1.0

Additional pipeline-specific requirements (for running in parallel on Torque/PBS-based clusters):

cwltool=1.0.20180306140409
cwltest=1.0.20180413145017
galaxy-lib=17.9.3
toil=3.15.0a1 (or higher, this is the minimum version required for Torque/PBS-based clusters)

Prerequisite files:

(make sure to place this in a location with plenty of space!):

Sequencing data (in fastq format). You may download our reference RBFOX2 HepG2 raw data here: RBFOX2
Genome STAR index directory (fasta files can be downloaded from UCSC; hg19)
Repeat element STAR index directory (We now recommend using the most up-to-date RepBase files corresponding to your species of interest, otherwise you may email us for an example human-based reference).
FASTA file containing barcodes for demultiplexing reads
- For paired-end data, use yeolabbarcodes_20170101.fasta
- For single-end data, use either a_adapters.fasta or the InvRNA*_adapters.fasta files, described below.
chrom.sizes file (tabbed file containing chromosome name and length, can be downloaded from UCSC; hg19)
Manifest YAML or JSON file describing paths of the above data
- For paired-end data, use this template
- For single-end data, use this template
Blacklist file containing potential artifact regions. These have been manually curated using ENCODE3 datasets and can be found here:
- hg19
- hg38

Description of the manifest

STAR indices:

speciesGenomeDir:
  class: Directory
  path: /path/to/stargenome

repeatElementGenomeDir:
  class: Directory
  path: /path/to/repeatelement

CLIPPER params:

species: hg19  # for supported species, see clipper docs

UMI & barcode params:

randomer_length: "5"  # (Paired-end only) length of the UMI assigned to each read. This may differ depending on the size of your randomer sequence.

barcodesfasta:  # (Paired-end only) This is a FASTA formatted file containing the barcodes we will use to demultiplex our FASTQ's:
  class: File
  path: /path/to/barcodes

Blacklist file:

blacklist_file: # (Single-end only) This is a BED6 file containing regions that will be excluded from the final peak outputs. Typically comprised of artifact regions such as tRNA/snoRNA/etc.)
  class: File
  path: /path/to/blacklist

The following YAML block describes the location paths of the forward (read1), reverse (read2) reads, and the barcodes required to demultiplex these reads for each sample.

Barcode names must match those described in the above barcodes.fasta file!

(For example, if you are using our standard paired-end barcodes here, make sure the barcodeids are one of: A01, A03, A04, B06, C01, D8f, F05, G07, X1A, X1B, X2A, X2B, or NIL for "inputs". Single-end protocols may not have inline barcodes. If this is the case, you will use the a_adapters.fasta. Else, SE protocols with inline barcodes will need the fasta file corresponding to the barcode in question:

We're showing two samples (2 replicates each) for a paired-end experiment described in this space. Each sample will be defined as indicated below each name: field. Make sure these names are unique per sample! They (and dataset name above) are used to determine the filename prefixes and non-unique IDs will override each other.

PE Data:

samples:
  -
    - ip_read:
      name: rep1_clip
      barcodeids: [A01, B06]
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      read2:
        class: File
        path: /path/to/clip.fastq.gz
        
    - input_read:
      name: rep1_input
      barcodeids: [NIL, NIL]
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      read2:
        class: File
        path: /path/to/clip.fastq.gz
  -
    - ip_read:
      name: rep2_clip
      barcodeids: [C01, D8f]
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      read2:
        class: File
        path: /path/to/clip.fastq.gz

    - input_read:
      name: rep2_input
      barcodeids: [NIL, NIL]
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      read2:
        class: File
        path: /path/to/clip.fastq.gz

SE Data:

samples:
  - 
    - ip_read:
      name: rep1_clip
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      adapters:
        class: File
        path: inputs/InvRNA1_adapters.fasta

    - input_read:
      name: 4020_INPUT1
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      adapters:
        class: File
        path: inputs/InvRNA5_adapters.fasta
  - 
    - ip_read:
      name: 4020_CLIP1
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      adapters:
        class: File
        path: inputs/InvRNA1_adapters.fasta

    - input_read:
      name: 4020_INPUT1
      read1:
        class: File
        path: /path/to/clip.fastq.gz
      adapters:
        class: File
        path: inputs/InvRNA5_adapters.fasta

Running the data with required arguments:

Assuming you have:

Downloaded and generated the relevant STAR indices
Installed CWL
Installed Docker (or alternatively, verified relevant binaries are located in your $PATH)
Ensured the relevant files are locatable in your $PATH (eclip/bin:eclip/cwl:eclip/wf)

You can run the pipeline using one of our wrappers in (wf/):

./paired_end_clip.yaml
./single_end_clip.yaml

Or, run the workflow using cwl in its native context:

cwltool wf_get_peaks_pe_scatter.cwl paired_end_clip.yaml
cwltool wf_get_peaks_se_scatter.cwl single_end_clip.yaml

Running on a complete dataset takes about a day for human ENCODE data (24 hours), so sit back and relax by reading the rest of this README.

Outputs:

Input-normalized peaks will contain candidate binding regions.

For Single-end eCLIP, you can expect outputs to follow this filestructure:

Dataset: "myRBP" name: "IP"	eCLIP-0.2.2	eCLIP-0.3.0+
Cutadapt x1 metrics	`myRBP.IP.umi.r1Tr.metrics`	`myRBP.IP.umi.r1.fqTr.metrics`
Cutadapt x2 metrics	`myRBP.IP.umi.r1TrTr.metrics`	`myRBP.IP.umi.r1.fqTrTr.metrics`
Demuxed + adapter trimmed reads	`myRBP.IP.umi.r1TrTr.fq`	`myRBP.IP.umi.r1TrTr.fq`
Repetitive element filtered reads	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.fq`	`myRBP.IP.umi.r1.fq.repeat-unmapped.sorted.fq.gz`
STAR metrics (repeat aligned)	`myRBP.IP.umi.r1TrTr.sorted.STARLog.final.out`	`myRBP.IP.umi.r1.fqTrTr.sorted.STARLog.final.out`
Unique genome aligned reads (sorted)	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.bam`	`myRBP.IP.umi.r1.fq.genome-mappedSoSo.bam`
STAR metrics (genome aligned)	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARLog.final.out`	`myRBP.IP.umi.r1.fq.repeat-unmapped.sorted.STARLog.final.out`
PCR duplicate removed aligned reads	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.bam`	`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.bam`
CLIPper peaks	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.bed`	`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.bed`
Input-normalized peaks	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.normed.compressed.bed`	`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.bed`
RPM-normalized BigWig files	`myRBP.IP.umi.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.norm.*.bw`	`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.norm.*.bw`
Blacklist-filtered peaks		`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.bed`
Blacklist-filtered bigBeds		`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.fx.bb`
Blacklist-filtered narrowPeaks		`myRBP.IP.umi.r1.fq.genome-mappedSoSo.rmDupSo.peakClusters.normed.compressed.sorted.blacklist-removed.narrowPeak`
made with: https://www.tablesgenerator.com/markdown_tables

For Paired-end eCLIP:

	eCLIP 0.2.x	eCLIP GATK	eCLIP 0.3+
Demuxed + adapter trimmed reads	`*.CLIP.barcode.r1TrTr.fq`	`RBFOX2-204-CLIP_S1_R*.A01_204_01_RBFOX2.adapterTrim.round2.fastq.gz`	`204.01_RBFOX2.A01.r*.fqTrTr.fqgz`
Repetitive element filtered reads	`*.CLIP.barcode.r1.fqTrTr.sorted.STARUnmapped.out.sorted.fq`	`RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rep.bamUnmapped.out.mate*`	`204.01_RBFOX2.A01.r*.fqTrTr.repeat-unmapped.sorted.fq.gz`
Unique genome aligned reads	`*.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.bam`	`RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rmRep.bam`	`204.01_RBFOX2.A01.r1.fq.genome-mappedSo.bam`
PCR duplicate removed aligned reads	`*.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.bam`	`RBFOX2-204-CLIP_S1_R1.A01_204_01_RBFOX2.adapterTrim.round2.rmRep.rmDup.sorted.bam`	`204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.bam`
Barcode merged alignments	`*.CLIP.barcode.r1.fqTrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.merged.r2.bam`	`204_01_RBFOX2.merged.r2.bam`	`204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.bam`
CLIPper peaks	`*.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.bed`	`204_01_RBFOX2.merged.r2.peaks.bed`	`204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.bed`
Input-normalized peaks	`*.CLIP.barcode.r1TrTr.sorted.STARUnmapped.out.sorted.STARAligned.outSo.rmDupSo.peakClusters.normed.compressed.bed`	`204_01.basedon_204_01.peaks.l2inputnormnew.bed.compressed.bed`	`204.01_RBFOX2.A01.r1.fq.genome-mappedSo.rmDupSo.merged.r2.peakClusters.normed.compressed.bed`

Notes regarding outputs (FAQ):

When going through the merged BAM file results, I can only find files with only one of the paired barcodes (e.g. A01 of A01/B06). Is this normal? Yes, *.merged*.bam indicates that both barcodes have been merged, I just use the first as a prefix namespace for the next step.
Regarding the narrowPeak files - The scores in the 5th column are almost always equal to 1000 or 200, where do these values come from and what do they represent? 200/1000 flags each peak as significant (1000) or not (200) visually on the genome browser
Regarding

References:

Van Nostrand, Eric L., et al. "Robust, Cost-Effective Profiling of RNA Binding Protein Targets with Single-end Enhanced Crosslinking and Immunoprecipitation (seCLIP)." mRNA Processing. Methods Mol Biol. 2017;1648:177-200.

Van Nostrand, E.L., Pratt, G.A., Shishkin, A.A., Gelboin-Burkhart, C., Fang, M.Y., Sundararaman, B., Blue, S.M., Nguyen, T.B., Surka, C., Elkins, K. and Stanton, R. "Robust transcriptome-wide discovery of RNA-binding protein binding sites with enhanced CLIP (eCLIP)." Nature methods 13.6 (2016): 508-514.

Amstutz, Peter; Crusoe, Michael R.; Tijanić, Nebojša; Chapman, Brad; Chilton, John; Heuer, Michael; Kartashov, Andrey; Leehr, Dan; Ménager, Hervé; Nedeljkovich, Maya; Scales, Matt; Soiland-Reyes, Stian; Stojanovic, Luka (2016): Common Workflow Language, v1.0. figshare. https://doi.org/10.6084/m9.figshare.3115156.v2 Retrieved: 22 13, May 11, 2017 (GMT)

Kurtzer GM, Sochat V, Bauer MW (2017): Singularity: Scientific containers for mobility of compute. PLoS ONE 12(5): e0177459. https://doi.org/10.1371/journal.pone.0177459

Name		Name	Last commit message	Last commit date
Latest commit History 106 Commits
bin		bin
cwl		cwl
documentation		documentation
example		example
tests/eCLIP-0.7.0		tests/eCLIP-0.7.0
wf		wf
wf_debug		wf_debug
.gitignore		.gitignore
CHANGELOG.md		CHANGELOG.md
ECLIP-VERSION-0.7.0		ECLIP-VERSION-0.7.0
LICENSE		LICENSE
README.md		README.md
eCLIP-flowchart.png		eCLIP-flowchart.png

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

eCLIP

Description/methods:

Installation:

Hardware recommendations:

The pipeline has been tested using the following softwares and their versions:

Additional pipeline-specific requirements (minimal, one node w/ one or more cores):

Additional pipeline-specific requirements (for running in parallel on Torque/PBS-based clusters):

Prerequisite files:

Description of the manifest

Running the data with required arguments:

Outputs:

Notes regarding outputs (FAQ):

References:

About

Releases 1

Packages

Contributors 3

Languages

License

YeoLab/eCLIP

Folders and files

Latest commit

History

Repository files navigation

eCLIP

Description/methods:

Installation:

Hardware recommendations:

The pipeline has been tested using the following softwares and their versions:

Additional pipeline-specific requirements (minimal, one node w/ one or more cores):

Additional pipeline-specific requirements (for running in parallel on Torque/PBS-based clusters):

Prerequisite files:

Description of the manifest

Running the data with required arguments:

Outputs:

Notes regarding outputs (FAQ):

References:

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Contributors 3

Languages

Packages