- Create a references/hg38 subfolder
- Download and g-unzip the FASTA file from the encode project in the references/hg38 folder (https://www.encodeproject.org/files/GRCh38_no_alt_analysis_set_GCA_000001405.15/@@download/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.gz)
- Within the hg38 subfolder create the bowtie2 index:
bowtie2-build GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta GRCh38_no_alt_analysis_set_GCA_000001405.15
- Within the references subfolder download and g-unzip the gencode annotations: https://ftp.ebi.ac.uk/pub/databases/gencode/Gencode_human/release_31/gencode.v31.basic.annotation.gtf.gz
- In the references folder, create a fai index using
samtools faidx hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta
- Extract the chromosome sizes
cut -f1,2 hg38/GRCh38_no_alt_analysis_set_GCA_000001405.15.fasta.fai > hg38.chrom.sizes
- In the references folder, download the regulatory build gff (ftp://ftp.ensembl.org/pub/release-98/regulation/homo_sapiens/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz)
- Parse the regulatory build file
python pipeline/parse_reg_build_file.py references/homo_sapiens.GRCh38.Regulatory_Build.regulatory_features.20190329.gff.gz references/hg38.chrom.sizes
- In the references folder, download and g-unzip the hg38_gencode_tss_unique.bed file from the official ENCODE repository https://storage.googleapis.com/encode-pipeline-genome-data/hg38/ataqc/hg38_gencode_tss_unique.bed.gz
- In the references folder, download and g-unzip the hg38.blacklist.bed file from the official ENCODE repository https://storage.googleapis.com/encode-pipeline-genome-data/hg38/hg38.blacklist.bed.gz
Edit the paths in the pipeline/atac/atacseq.yaml file to point to the newly created reference files and to the location of the spp script
- Create the conda environments
conda env create python=2.7 -f ./pipeline/env_config/pipeline_env.yml
conda env create -f ./notebooks/notebooks_env.yml
- On the LUSTRE cluster load the relevant modules and activate the environment
source ./pipeline/env_config/activate_env.sh
conda activate bcg_notebooks
- Start Jupyter lab and check the connection string in the jupyterlab.err logfile
sbatch notebooks/jupyter_lab.sh
- Run the notebooks/0000.01-Prepare_pipeline_input notebook.ipynb to generate the annotations to run the pipeline
- Activate the pipeline environemnt
conda activate bcg_pipeline
- Run the pipeline for all samples
looper run ./pipeline/bcg_pipeline.yaml
- Summarize the results for all samples
looper summarize ./pipeline/bcg_pipeline.yaml
The notebooks bust be run within jupyter lab launcehd within the "bcg_notebooks" environment.
- Create the complete_metadata file using the "0001.01-Create_Annotations" notebook
- Run QC to set the QC flag using the "0001.02-QC.stats" notebook
- Run Quantification (count matrix), Binary Quantification (binary matrix) and median signal tracks (bigWig) using the 0001.03-Quantification notebook
- To create the configuration files for the peak annotation software UROPA use the 0001.04.a-Features_analysis notebook
- Run the peak annotation software jobs:
ls data/quantification/characterization_ALL_V4/*sub|while read script;do sbatch $script;done
- To combine the results of peak annotation use the 0001.04.b-Features_analysis notebook