nf-core-hic is a bioinformatics best-practice analysis pipeline for Hi-C/Capture-C data analysis. This pipeline is optimal for large scale analysis in High Performance Computing Clusters (HPCs) and Cloud Computing environments (eg. AWS). Also, it can be executed in hybrid environments (eg. LSF/AWS hybrid run).
The pipeline is built using Nextflow, a workflow tool to run tasks across multiple compute infrastructures in a very portable manner. It uses Docker/Singularity containers making installation trivial and results highly reproducible. This workflow is based on nf-core template.
- fastq2pair (per library):
- Preprocessing (
fastp
) >>library.html
&library.json
- Alignment (
bwa
) >>library.cram
/library.cram.crai
- Extract ligation junctions (
pairtools
) - Remove PCR/optical duplicates (
pairtools
) >>library.pairs.gz
&library.dedup.stats.txt
- Make pairs cram file (
pairtools
&samtools
) >>library.pairs.cram
/library.pairs.cram.crai
- These steps are based on this Dovetail tutorial. Check the link for more details
- Preprocessing (
- Merge all
library.pairs.gz
&library.pairs.cram
for libraries per individual sample >>sample.pairs.gz
&sample.pairs.cram
- Make
.mcool
file (cooler
) >>sample.mcool
- Initial steps Similar to Hi-C Workflow (steps 1-3)
4. QC for Capture (Baits regions coverage)
5. Make bam file compatible with CHiCAGO algorithm (samtools
)
- This workflow is intended to check library Complexity from shallow-depth sequencing for QC before doing deep sequencing. it is based on this Dovetail tutorial.
- fastq2pair (per library): Same steps as in HiC and Capture-C workflows.
- Estimate library complexity (
preseq
) >>sample.preseq.txt
. For interpretation of this results refer to Dovetail tutorial
-
Install
Nextflow
(>=22.10.1
) -
Install any of
Docker
,Singularity
(you can follow this tutorial),Podman
,Shifter
orCharliecloud
for full pipeline reproducibility (this pipeline can NOT be run with conda)). This requirement is not needed for running the pipeline in WashU RIS cluster. This pipeline is also successfully tested using Amazon Cloud Computing (AWS). For details on how to run nextflow pipelines in AWS refer to nextflow documentation and to this excellent tutorial. -
Download the pipeline and test it on a minimal dataset with a single command:
nextflow run dhslab/nf-core-hic -profile test,YOURPROFILE(S) --outdir <OUTDIR>
-
Start running your own analysis!
- Hi-C workflow (default)
nextflow run dhslab/nf-core-hic -r dev -latest \ -profile YOURPROFILE(S) \ --input <SAMPLESHEET> \ --fasta <FASTA> \ --bwa_index <INDEX_PREFIX> \ --chromsizes <CHROMSIZES> \ --genome <GENOME_NAME> \ --outdir <OUTDIR>
- Capture-C workflow
nextflow run dhslab/nf-core-hic -r dev -latest \ -entry capture \ -profile YOURPROFILE(S) \ --input <SAMPLESHEET> \ --fasta <FASTA> \ --bwa_index <INDEX_PREFIX> \ --chromsizes <CHROMSIZES> \ --genome <GENOME_NAME> \ --baits_bed <BAITS_BED> \ --outdir <OUTDIR>
- QC workflow
nextflow run dhslab/nf-core-hic -r dev -latest \ -entry qc \ -profile YOURPROFILE(S) \ --input <SAMPLESHEET> \ --fasta <FASTA> \ --bwa_index <INDEX_PREFIX> \ --chromsizes <CHROMSIZES> \ --genome <GENOME_NAME> \ --outdir <OUTDIR>
- any number of profiles/config-files can be used. Just consider how configuration priorities are set in nextflow as documented here
- Input
samplesheet.cvs
which provides paths forfastq1
,fastq2
raw reads and their metadata (id
,sample
,library
,flowcell
). this can be provided either in a configuration file or as--input path/to/samplesheet.cvs
command line parameter. Example sheet located inassets/samplesheet.csv
. - Genome fasta, either in a configuration file or as
--fasta path/to/genome.fasta
command line parameter. - BWA index, either in a configuration file or as
--bwa_index path/to/bwa_index/with_prefix
command line parameter. It is important to provide the full path including index prefix. - Chromosome sizes file, either in a configuration file or as
--chromsizes path/to/chromsizes
command line parameter. - Genome name (eg. hg38), either in a configuration file or as
--fasta path/to/genome.fasta
command line parameter. - Capture-C Baits bed file (Only in Capture-C workflow) either in a configuration file or as
--baits_bed path/to/baits_bed
command line parameter.
The following parameters are set to the shown default values, but should be modified when required in command line, or in user-provided config files:
Parameter | Description | Type | Default |
---|---|---|---|
trim_qual |
fastp -q option for the quality value that a base is qualified |
integer |
15 |
Parameter | Description | Type | Default |
---|---|---|---|
parsemq |
pairtools parse --min-mapq option for the minimal MAPQ score to consider a read as uniquely mapped |
integer |
1 |
parse_walks_policy |
pairtools parse --walks-policy option. See pairtools documentation for details |
string |
5unique |
parse_max_gap |
pairtools parse --max-inter-align-gap option. See pairtools documentation for details |
integer |
30 |
max_mismatch |
pairtools dedup --max-mismatch option. Pairs with both sides mapped within this distance (bp) from each other are considered duplicates |
integer |
1 |
Parameter | Description | Type | Default |
---|---|---|---|
resolutions |
cooler zoomify --resolutions option: Comma-separated list of target resolutions |
string |
1000000,500000,250000,100000,50000,20000,10000,5000 |
min_res |
Minimum resolution for the mcool file (from the resolutions list provided) | integer |
5000 |
mcool_mapq_threshold |
Minimum resolution for the mcool file (from the resolutions list provided) | string |
1 30 |
Parameter | Description | Type | Default |
---|---|---|---|
baits_bed |
Bed file for regions targeted by Capture baits. Required only for Capture-C workflow | string |
None |
.
├── pipeline_info
│ ├── execution_report_2023-02-17_23-57-34.html
│ ├── execution_timeline_2023-02-17_23-57-34.html
│ ├── execution_trace_2023-02-17_23-57-34.txt
│ ├── pipeline_dag_2023-02-17_23-57-34.html
│ ├── samplesheet.valid.csv
│ └── software_versions.yml
└── samples
└── TEST
├── fastq2pairs
│ ├── TESTA
│ │ ├── TESTA.cram
│ │ ├── TESTA.cram.crai
│ │ ├── TESTA.dedup.stats.txt
│ │ ├── TESTA.fastp.html
│ │ ├── TESTA.fastp.json
│ │ ├── TESTA.pairs.cram
│ │ ├── TESTA.pairs.cram.crai
│ │ └── TESTA.pairs.gz
│ ├── TESTB
│ │ ├── TESTB.cram
│ │ ├── TESTB.cram.crai
│ │ ├── TESTB.dedup.stats.txt
│ │ ├── TESTB.fastp.html
│ │ ├── TESTB.fastp.json
│ │ ├── TESTB.pairs.cram
│ │ ├── TESTB.pairs.cram.crai
│ │ └── TESTB.pairs.gz
│ ├── TESTC
│ │ ├── TESTC.cram
│ │ ├── TESTC.cram.crai
│ │ ├── TESTC.dedup.stats.txt
│ │ ├── TESTC.fastp.html
│ │ ├── TESTC.fastp.json
│ │ ├── TESTC.pairs.cram
│ │ ├── TESTC.pairs.cram.crai
│ │ └── TESTC.pairs.gz
│ └── merged
│ ├── TEST.pairs.cram
│ ├── TEST.pairs.cram.crai
│ └── TEST.pairs.gz
└── mcool
├── TEST.mapq_1.mcool
└── TEST.mapq_30.mcool
- The pipeline is developed and optimized to be run in WashU RIS (LSF) HPC, but could be deployed in any
HPC environment supported by Nextflow
. - The pipeline does NOT support conda.
- The Test workflow can be run on personal computer, but is not advised. It is recommended to do the testing in environment with at least 16 GB memory. If the test workflow failed (especially at fastq2pair step ), try re-run with more allocated resources. Such errors are likely because of broken pipes due to maxed-out memory. The pipeline is designed with several pipe steps to avoid making large intermediate files.