New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

dhspence · 2022-07-19T12:19:58Z

We need a workflow to do the following, starting with output from the DRAGEN tumor/normal analysis. Here is an example location with the files that will be used for this analysis:

/storage1/fs1/dspencer/Active/spencerlab/dhs/projects/cs1cart_wgs/WSCS1CART_Validation1May2022

The workflow should do this:

-Filter all passing variants from *.hard-filtered.vcf.gz, *.sv.vcf.gz, and *.cnv.vcf.gz into a new combined VCF file. Note that in the CNV file, records should be gain or loss (not DRAGEN:REF) and they can have the 'lowModelConfidence' filter flag, but no others.

-Annotate the above file with VEP

-Generate a text file from both VCF files so we can open in excel, etc, see: /storage1/fs1/dspencer/Active/spencerlab/dhs/scripts/vep2txt.py

-Extract and count hits to transgene sequences, if provided as input. Here is a command line way to do this (running in the above directory):

samtools view -T -F 0x400 ../hg38cs1car.fa WSCS1CART_Validation1May2022.cram cs1car | awk -v WIN=$WINDOWSIZE '$7!="=" && $5>0 { print $7,sprintf("%d",$8/WIN)*WIN; }' | sort | uniq -c | awk '{ print $2,$3,$3+1,$1; }' | awk '{ print $1,$2,$3,"INS","+","sv"c++"-"$4; }' > carhits.svformat.bed

(where $WINDOWSIZE is the size in bp over which to collapse multiple hits into 1)

Then annotate the carhits.svformat.bed file with VEP:

/usr/bin/perl -I /opt/vep/lib/perl/VEP/Plugins /opt/vep/src/ensembl-vep/vep --plugin Downstream --fasta $HG38 --hgvs --symbol --term SO --flag_pick -i carhits.svformat.bed --offline --cache --max_af --dir /storage1/fs1/gtac-mgi/Active/CLE/reference/VEP_cache -o carhits.svformat.output.bed

After this analysis, the data should be used to make the figures/tables in the attached document:

CS1-Validation-1-analysis.docx

It would be good to make the read hit analysis optional in the worklow.

dhspence assigned miller-alexander Jul 19, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

dhspence commented Jul 19, 2022

New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

Comments

dhspence commented Jul 19, 2022