Skip to content

Commit

Permalink
ExAC minus somatic; travis CI tests
Browse files Browse the repository at this point in the history
  • Loading branch information
ckandoth committed Nov 1, 2015
1 parent ab47fcc commit 697177c
Show file tree
Hide file tree
Showing 8 changed files with 2,587 additions and 90,521 deletions.
30 changes: 27 additions & 3 deletions .travis.yml
Original file line number Diff line number Diff line change
@@ -1,6 +1,30 @@
# Reference: http://docs.travis-ci.com/user/languages/perl/
language: perl
perl:
- "5.18"
- "5.14"
- "5.10"
- "5.22"

branches:
only:
- master

env:
- VEP_PATH=~/vep
- VEP_DATA=~/.vep
- PERL5LIB=$VEP_PATH:$PERL5LIB
- PATH=$VEP_PATH/htslib:$PATH

before_install:
- sudo apt-get install -y curl rsync tar make perl perl-base
- cpanm --notest LWP::Simple LWP::Protocol::https Archive::Extract Archive::Tar Archive::Zip CGI DBI Time::HiRes

install:
- mkdir $VEP_PATH $VEP_DATA
- cd $VEP_PATH
- curl -LO https://github.com/Ensembl/ensembl-tools/archive/release/82.tar.gz
- tar -zxf 82.tar.gz --starting-file variant_effect_predictor --transform='s|.*/|./|g'
- perl INSTALL.pl --AUTO acf --SPECIES homo_sapiens --ASSEMBLY GRCh37 --DESTDIR $VEP_PATH --CACHEDIR $VEP_DATA
- perl convert_cache.pl --species homo_sapiens --version 82_GRCh37 --dir $VEP_DATA
- cd $TRAVIS_BUILD_DIR

script:
- perl maf2maf.pl --input-maf data/test.maf --output-maf test.vep.maf --tmp-dir anno_files --custom-enst data/isoform_overrides_at_mskcc --vep-forks 1
18 changes: 4 additions & 14 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -91,24 +91,14 @@ Convert the offline cache for use with tabix, that significantly speeds up the l

perl convert_cache.pl --species homo_sapiens,mus_musculus --version 82_GRCh37,82_GRCh38,82_GRCm38 --dir $VEP_DATA

Download and tabix-index the VCF needed by VEP's ExAC plugin:
Download and index a custom ExAC r0.3 VCF, that skips variants overlapping known somatic hotspots:

cd $VEP_DATA
curl -LO ftp://ftp.broadinstitute.org/pub/ExAC_release/release0.3/ExAC.r0.3.sites.vep.vcf.gz
tabix -p vcf ExAC.r0.3.sites.vep.vcf.gz

Download and tabix-index the datafile needed by VEP's dbNSFP plugin:

cd $VEP_DATA
curl -L ftp://[email protected]/dbNSFPv3.0c.zip > dbNSFP/dbNSFPv3.0c.zip
unzip dbNSFPv3.0c.zip
cat dbNSFP*chr* | bgzip -c > dbNSFP.gz
tabix -s 1 -b 2 -e 2 dbNSFP.gz
curl -L https://googledrive.com/host/0B6o74flPT8FAYnBJTk9aTF9WVnM > $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz
tabix -p vcf $VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz

Test running VEP in offline mode with the ExAC plugin, on the provided sample GRCh37 VCF:

cd $VEP_PATH
perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --everything --shift_hgvs 1 --check_existing --check_alleles --total_length --allele_number --no_escape --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/82_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --plugin ExAC,$VEP_DATA/ExAC.r0.3.sites.vep.vcf.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt
perl variant_effect_predictor.pl --species homo_sapiens --assembly GRCh37 --offline --no_progress --everything --shift_hgvs 1 --check_existing --check_alleles --total_length --allele_number --no_escape --xref_refseq --dir $VEP_DATA --fasta $VEP_DATA/homo_sapiens/82_GRCh37/Homo_sapiens.GRCh37.75.dna.primary_assembly.fa --plugin ExAC,$VEP_DATA/ExAC.r0.3.sites.minus_somatic.vcf.gz --input_file example_GRCh37.vcf --output_file example_GRCh37.vep.txt

Authors
-------
Expand Down
90,492 changes: 0 additions & 90,492 deletions data/tcga_brca.maf

This file was deleted.

2,546 changes: 2,546 additions & 0 deletions data/tcga_laml.maf

Large diffs are not rendered by default.

1 change: 1 addition & 0 deletions docs/vep_maf_readme.txt
Original file line number Diff line number Diff line change
Expand Up @@ -86,3 +86,4 @@ http://useast.ensembl.org/info/docs/tools/vep/vep_formats.html#output
105. ExAC_AF_NFE - Non-Finnish European Allele Frequency from ExAC
106. ExAC_AF_OTH - Other Allele Frequency from ExAC
107. ExAC_AF_SAS - South Asian Allele Frequency from ExAC
108. GENE_PHENO - Indicates if gene that the variant maps to is associated with a phenotype, disease or trait
6 changes: 3 additions & 3 deletions maf2maf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -39,7 +39,7 @@
AMR_MAF ASN_MAF EAS_MAF EUR_MAF SAS_MAF AA_MAF EA_MAF CLIN_SIG SOMATIC PUBMED MOTIF_NAME
MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS TSL HGVS_OFFSET PHENO
MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN ExAC_AF_NFE ExAC_AF_OTH
ExAC_AF_SAS );
ExAC_AF_SAS GENE_PHENO );

# Check for missing or crappy arguments
unless( @ARGV and $ARGV[0]=~m/^-/ ) {
Expand Down Expand Up @@ -111,12 +111,12 @@
( -s $ref_fasta ) or die "ERROR: Reference FASTA not found: $ref_fasta\n";

# Contruct VEP command using some default options and run it
my $vep_cmd = "$perl_bin $vep_path/variant_effect_predictor.pl --species $species --assembly $ncbi_build --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --regulatory --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --check_alleles --check_ref --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $vep_data --fasta $ref_fasta --input_file $vcf_file --output_file $vep_anno";
my $vep_cmd = "$perl_bin $vep_path/variant_effect_predictor.pl --species $species --assembly $ncbi_build --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --regulatory --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --check_alleles --check_ref --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $vep_data --fasta $ref_fasta --input_file $vcf_file --output_file $vep_anno";
$vep_cmd .= " --fork $vep_forks" if( $vep_forks > 1 ); # VEP barks if it's set to 1
# Add options that only work on human variants
$vep_cmd .= " --polyphen b --gmaf --maf_1kg --maf_esp" if( $species eq "homo_sapiens" );
# Add options that only work on human variants mapped to the GRCh37 reference genome
$vep_cmd .= " --plugin ExAC,$vep_data/ExAC.r0.3.sites.vep.vcf.gz" if( $species eq "homo_sapiens" and $ncbi_build eq "GRCh37" );
$vep_cmd .= " --plugin ExAC,$vep_data/ExAC.r0.3.sites.minus_somatic.vcf.gz" if( $species eq "homo_sapiens" and $ncbi_build eq "GRCh37" );

# Make sure it ran without error codes
system( $vep_cmd ) == 0 or die "\nERROR: Failed to run the VEP annotator!\nCommand: $vep_cmd\n";
Expand Down
4 changes: 2 additions & 2 deletions maf2vcf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -147,7 +147,7 @@
my $prefix_bp = `$samtools faidx $ref_fasta $chr:$pos-$pos | grep -v ^\\>`;
chomp( $prefix_bp );
$prefix_bp = uc( $prefix_bp );
( $prefix_bp =~ m/^[ACGTN]$/ ) or die "ERROR: Cannot retreive bp at $chr:$pos! Please specify --ref-fasta appropriately\n";
( $prefix_bp =~ m/^[ACGTN]$/ ) or die "ERROR: Cannot retreive bp at $chr:$pos! Please check that you chose the right reference genome with --ref-fasta:\n$ref_fasta\n";
# Blank out the dashes (or other weird chars) used with indels, and prefix the fetched bp
( $ref, $al1, $al2, $n_al1, $n_al2 ) = map{s/^(\?|-|0)+$//; $_=$prefix_bp.$_} ( $ref, $al1, $al2, $n_al1, $n_al2 );
}
Expand Down Expand Up @@ -263,7 +263,7 @@ =head1 OPTIONS
=head1 DESCRIPTION
This script breaks down variants in a MAF into VCFs for each tumor-normal pair, in preparation for annotation with vcf2maf
This script breaks down variants in a MAF into VCFs for each tumor-normal pair, in preparation for annotation with VEP. Creates a multi-sample VCF in addition to per-TN pair.
=head2 Relevant links:
Expand Down
11 changes: 4 additions & 7 deletions vcf2maf.pl
Original file line number Diff line number Diff line change
Expand Up @@ -214,22 +214,19 @@ sub GetBiotypePriority {
$output_vcf =~ s/(\.vcf)*$/.vep.vcf/;

# Skip running VEP if an annotated VCF already exists
if( -s $output_vcf ) {
warn "WARNING: Annotated VCF already exists ($output_vcf). Skipping re-annotation.\n";
}
else {
unless( -s $output_vcf ) {
warn "STATUS: Running VEP and writing to: $output_vcf\n";
# Make sure we can find the VEP script and the reference FASTA
( -s "$vep_path/variant_effect_predictor.pl" ) or die "ERROR: Cannot find VEP script variant_effect_predictor.pl in path: $vep_path\n";
( -s $ref_fasta ) or die "ERROR: Reference FASTA not found: $ref_fasta\n";

# Contruct VEP command using some default options and run it
my $vep_cmd = "$perl_bin $vep_path/variant_effect_predictor.pl --species $species --assembly $ncbi_build --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --regulatory --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --check_alleles --check_ref --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $vep_data --fasta $ref_fasta --input_file $input_vcf --output_file $output_vcf";
my $vep_cmd = "$perl_bin $vep_path/variant_effect_predictor.pl --species $species --assembly $ncbi_build --offline --no_progress --no_stats --sift b --ccds --uniprot --hgvs --symbol --numbers --domains --gene_phenotype --regulatory --canonical --protein --biotype --uniprot --tsl --pubmed --variant_class --shift_hgvs 1 --check_existing --check_alleles --check_ref --total_length --allele_number --no_escape --xref_refseq --failed 1 --vcf --minimal --flag_pick_allele --pick_order canonical,tsl,biotype,rank,ccds,length --dir $vep_data --fasta $ref_fasta --input_file $input_vcf --output_file $output_vcf";
$vep_cmd .= " --fork $vep_forks" if( $vep_forks > 1 ); # VEP barks if it's set to 1
# Add options that only work on human variants
$vep_cmd .= " --polyphen b --gmaf --maf_1kg --maf_esp" if( $species eq "homo_sapiens" );
# Add options that only work on human variants mapped to the GRCh37 reference genome
$vep_cmd .= " --plugin ExAC,$vep_data/ExAC.r0.3.sites.vep.vcf.gz" if( $species eq "homo_sapiens" and $ncbi_build eq "GRCh37" );
$vep_cmd .= " --plugin ExAC,$vep_data/ExAC.r0.3.sites.minus_somatic.vcf.gz" if( $species eq "homo_sapiens" and $ncbi_build eq "GRCh37" );

# Make sure it ran without error codes
system( $vep_cmd ) == 0 or die "\nERROR: Failed to run the VEP annotator!\nCommand: $vep_cmd\n";
Expand Down Expand Up @@ -259,7 +256,7 @@ sub GetBiotypePriority {
EXON INTRON DOMAINS GMAF AFR_MAF AMR_MAF ASN_MAF EAS_MAF EUR_MAF SAS_MAF AA_MAF EA_MAF CLIN_SIG
SOMATIC PUBMED MOTIF_NAME MOTIF_POS HIGH_INF_POS MOTIF_SCORE_CHANGE IMPACT PICK VARIANT_CLASS
TSL HGVS_OFFSET PHENO MINIMISED ExAC_AF ExAC_AF_AFR ExAC_AF_AMR ExAC_AF_EAS ExAC_AF_FIN
ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS );
ExAC_AF_NFE ExAC_AF_OTH ExAC_AF_SAS GENE_PHENO );
my @ann_cols_format; # To store the actual order of VEP data, that may differ between runs
push( @maf_header, @ann_cols );

Expand Down

0 comments on commit 697177c

Please sign in to comment.