This is a repository for the snakemake version of the bash RNASeq pipeline compatible with the Clemson University's Center for Human Genetics (CUCHG) High Performance Computing (HPC) cluster.
- slurm/config.yaml: config file for HPC architecture and slurm compatibility
- snakemake_submitter.sh: initiates conda environment and submits the snakemake job to snakemake
- initiator.sh: sets up the directory and launches the snakemake_submitted.sh
- Snakefile: the pipeline
- RNASeq.yaml: environmental variables for the pipeline
If you use this pipeline, please cite the following:
- COBRE Grant (P20 GM139767) for support for use of Clemson University Center for Human Genetics Research Core facilities
- Clemson University Center for Human Genetics Bioinformatics and Statistics Core
only install these if not running the pipeline on CUCHG's HPC
- Anaconda3/miniconda3
- snakemake
- fastp
- java_jdk/>=1.8
- bbmap/>=38.73
- gmap_gsnap
- SalmonTE/>=0.4
- samtools/>=1.10
- subread/>=1.6.4
- slurm
generally, add information encompassed by "<>" in the files below
- slurm/config.yaml:
- RNASeq.yaml: fill out all information except EXT
- snakemake_submitter.sh:
- sbatch parameters:
- add job name (string)
- partition name (string)
- time (in Hr:Min:Sec format)
- output and error (add path to working directory, same as DEST from RNASeq.yaml, but leave the /log... parts unchanged)
- mail-user (add user email address)
cd
line: add path to working directory (same as DEST from RNASeq.yaml)- source line: add path to conda initiation script (conda.sh) to choose the right conda
- conda activate line: add the name of the environment with a working snakemake installation (on Secretariat it is "snakemake")
- sbatch parameters:
- Open ssh shell (using MobaXterm or Putty) on head/master/login node
- Make a working directory for the analysis and git clone this repository:
git clone https://github.com/chg-bsl/snakemake_rnaseq.git
- Copy Snakefile, snakemake_submitter.sh, RNASeq.yaml, slurm/config.yaml and initiator.sh to working directory
- Make sure the variables encompassed by "<>" in slurm/config.yaml, RNASeq.yaml and snakemake_submitter.sh have been modified to reflect info specific to your run (eg: working directory, raw data location, etc)
- Open a ssh shell and run:
Generate DAG figure:
##Initialize the correct conda and bring conda into bash environment source <path to conda initialization script> ##Activate the correct conda environment containing snakemake installation conda activate <snakemake conda environment> cd <working directory containing analysis pipeline files>
Generate the workflow:snakemake -n -p -s Snakefile --configfile RNASeq.yaml --profile slurm --dag | display | dot
snakemake -n -p -s Snakefile --configfile RNASeq.yaml --profile slurm
If step 5 in test run (Generate the DAG figure and Generate the workflow commands) do not generate any errors (red text), run:
./initiator.sh
There are three places to check for progress:
squeue
- This pipeline (when run successfully) will create log and logs_slurm directories within the working directory. In the log directory, look for output_<job_ID>.txt and error_<job_ID>.txt for current status of the run. When the run is successful, the last line should contain a x of x steps (100%) done.
- In the logs_slurm directory, the most current log files with specific rule names on the file names represent the current statuses.
- Make a link to image showing what the directory should look like
- Create a rule to grep module add to summarize a session info