Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

New workflow to annotate DRAGEN output and identify reads mapping to transgenic sequences #1

Open
dhspence opened this issue Jul 19, 2022 · 0 comments
Assignees

Comments

@dhspence
Copy link
Contributor

We need a workflow to do the following, starting with output from the DRAGEN tumor/normal analysis. Here is an example location with the files that will be used for this analysis:

/storage1/fs1/dspencer/Active/spencerlab/dhs/projects/cs1cart_wgs/WSCS1CART_Validation1May2022

The workflow should do this:

-Filter all passing variants from *.hard-filtered.vcf.gz, *.sv.vcf.gz, and *.cnv.vcf.gz into a new combined VCF file. Note that in the CNV file, records should be gain or loss (not DRAGEN:REF) and they can have the 'lowModelConfidence' filter flag, but no others.

-Annotate the above file with VEP

-Generate a text file from both VCF files so we can open in excel, etc, see: /storage1/fs1/dspencer/Active/spencerlab/dhs/scripts/vep2txt.py

-Extract and count hits to transgene sequences, if provided as input. Here is a command line way to do this (running in the above directory):

samtools view -T -F 0x400 ../hg38cs1car.fa WSCS1CART_Validation1May2022.cram cs1car | awk -v WIN=$WINDOWSIZE '$7!="=" && $5>0 { print $7,sprintf("%d",$8/WIN)*WIN; }' | sort | uniq -c | awk '{ print $2,$3,$3+1,$1; }' | awk '{ print $1,$2,$3,"INS","+","sv"c++"-"$4; }' > carhits.svformat.bed

(where $WINDOWSIZE is the size in bp over which to collapse multiple hits into 1)

Then annotate the carhits.svformat.bed file with VEP:

/usr/bin/perl -I /opt/vep/lib/perl/VEP/Plugins /opt/vep/src/ensembl-vep/vep --plugin Downstream --fasta $HG38 --hgvs --symbol --term SO --flag_pick -i carhits.svformat.bed --offline --cache --max_af --dir /storage1/fs1/gtac-mgi/Active/CLE/reference/VEP_cache -o carhits.svformat.output.bed

After this analysis, the data should be used to make the figures/tables in the attached document:

CS1-Validation-1-analysis.docx

It would be good to make the read hit analysis optional in the worklow.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants