Reproduces analyses and figures from "Putatively cancer-specific exon–exon junctions are shared across patients and present in developmental and other non-cancer cells" by David et al.
We ran Python scripts with Python 3.6.6, which we installed as part of Anaconda 5.2.0. We made use of the third-party modules intervaltree
2.1.0 and seaborn
0.9.0.
We ran R scripts with R 3.6.1 and made use of Data.table
1.12.2, ggplot2
3.2.1, gridExtra
2.3, RColorBrewer
1.1-2, dplyr
0.8.3, survminer
0.4.6, survival
2.44-1.1, ggpubr
0.2.4, and magrittr
1.5.
We further used Snaptron via its qs
utility. To obtain it, run
git clone https://github.com/ChristopherWilks/snaptron-experiments
We downloaded exon-exon junction BEDs for GTEx and TCGA and accompanying metadata from recount2:
- GTEx junction IDs (
GTEX_JUNCTION_BED
) - GTEx samples with junctions (
GTEX_JUNCTION_COVERAGE
) - GTEx phenotype file (
GTEX_PHEN
) - TCGA junction IDs (
TCGA_JUNCTION_BED
) - TCGA samples with junctions (
TCGA_JUNCTION_COVERAGE
) - TCGA phenotype file (
TCGA_PHEN
) - recount sample IDs (
RECOUNT_SAMPLE_IDS
)
We also used:
- GENCODE v28 (
GENCODE_GTF
) - TCGA tumor mutation burden calls (
TCGA_TMB
) - COSMIC's cancer gene census (after login, this should download as a CSV;
CANCER_GENE_CENSUS
) - OncoKB cancer gene list (
ONCOKB_GENES
) - Patient somatic mutation call files: “Mutation_Packager_Oncotated_Calls” tar.gz for each cancer type from http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/ (For full list of the direct file paths for mutation call file downloads see the file mutation_call_file_list.txt in the splicing_factor_mutation_analysis directory.)
- UniProt splicing-associated gene file (TSV): keyword:"mRNA splicing [KW-0508]" & Reviewed:yes & organism:"Homo sapiens (Human) [9606]" from https://www.uniprot.org/uniprot/
Note we also made use of metaSRA's ontology terms obtained via its API and available in https://github.com/JulianneDavid/shared-cancer-splicing/tree/master/SRA_junction_download/MetaSRA_run_files .
-
Create a new directory to store your database (mkdir
DB_DIR
) and then create TCGA and GTEx junction index by runningjx_indexer.py
inindex
mode:python3 jx_indexer.py -d DB_DIR index -c GTEx_JUNCTION_COVERAGE -C TCGA_JUNCTION_COVERAGE -b GTEx_JUNCTION_BED -B TCGA_JUNCTION_BED -p GTEx_PHEN -P TCGA_PHEN -s RECOUNT_SAMPLE_IDS -g GENCODE_ANNOTATION_GTF
Note: steps 1 and 2 require significant temporary file space to execute. Please ensure that your system has ~20GB available space in the directory where temporary files are stored, or assign temp file storage to a new directory. If your temporary directory does not have enough available space, you will be directed to re-run step 1 in "index-only" mode, the same command as above but with the addition of the "-i" flag. This is NOT neccessary if your first index mode (step 1) run was successful.
-
Run
jx_indexer.py
inexperiment
mode on junction index to collect junction files for analysis:python3 jx_indexer.py -d DB_DIR experiment -o OUTPUT_DIR
OUTPUT_DIR
will now contain:
all_jxs
(ALL_JX_DIR
) for use in set membership annotation;non-core-normal_jxs_per_sample
(NON_CORE_NORMAL_DIR
) andnon-tissue-matched_jxs_per_sample
(NON_TISSUE_MATCHED_NORMAL_DIR
) for use in set membership annotation;non-core-normal_counts_per_sample
(COUNTS_PER_SAMPLE_DIR
) (junctions not found in core normals) to generate Figure 1B;non-core-normal_jx_prevalences
(PREVALENCE_FILE_DIR
) (TCGA cancer junctions not found in core normals with cancer-type prevalence, per cancer type) to generate Figures 2A, 2B, 2C, and S2A;FigS1B_non-core-normal_counts_per_sample
(FILTERED_NCN_JX_PER_SAMPLE_DIR
) (coverage- and annotation-filtered junctions not found in GTEx or TCGA tissue-matched normal samples) andFigS1B_non-tissue-matched_counts_per_sample
(FILTERED_NTM_JX_PER_SAMPLE_DIR
) (coverage- and annotation-filtered junctions not found in core normals) to generate Figure S1B;jx_non-core-normal_jxs_per_sample
(NON_CORE_NORMAL_JX_COORD_DIR
) for use in Figures S1F, S1G, S1H, and S1I;- (
all_jxs_per_sample_paired_normals
)ALL_JXS_PER_SAMPLE_DIR
containing all junctions for TCGA SKCM normal samples, TCGA SKCM tumor samples, and GTEx bulk skin normal samples to generate Figure S2B.
-
Call this directory
METASRA_QUERY_FILES
. Run:python3 create_snaptron_queries.py -s METASRA_QUERY_FILES -i RECOUNT_SAMPLE_IDS -o LOGGING_OUTPUT_DIR -p SRA_JUNCTION_OUTPUT_DIR -u UTILITIES_DIR
where UTILITIES_DIRECTORY
is where snaptron-experiments
was previously cloned (see Requirements). SRA_JUNCTION_OUTPUT_DIR
is where the results will appear.
- Run each
SAMPLE_TYPE_indiv_chrom_batch_query_script.sh
inSRA_JUNCTION_OUTPUT_DIR
to collect SRA junctions.
SRA_JUNCTION_OUTPUT_DIR
will now contain:
SRA_noncancer_rawresults
(SNAPTRON_NONCANCER_DIR
) for non-cancer junction resultsSRA_noncancer_exptlists
(SNAPTRON_NONCANCER_EXPTLIST_DIR
) for non-cancer experiment listsSRA_cancer_rawresults
(SNAPTRON_CANCER_DIR
) for cancer junction resultsSRA_cancer_exptlists
(SNAPTRON_CANCER_EXPTLIST_DIR
) for cancer experiment lists
-
Run
collect_1-read_TCGA_jxs.py
to collect single-read TCGA junction collection:python3 collect_1-read_TCGA_jxs.py -C TCGA_JUNCTION_COVERAGE -B TCGA_JUNCTION_BED -o JSON_DIR
JSON_DIR
will now contain:
-
single-read-tcga-jxs_json.txt
containing single-read junctions for use in set membership annotation runs below, usingset_membership_annotation.py
, as follows:python3 set_membership_annotation.py --db-path DB_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JX_DIR -g NON_CORE_NORMAL_DIR -p NON_TISSUE_MATCHED_NORMAL_DIR --gtf-file GENCODE_ANNOTATION_GTF --single-read-jx-json JSON_DIR/single-read-tcga-jxs_json.txt --cancer-sra-directory SNAPTRON_CANCER_DIR --cancer-gene-census CANCER_GENE_CENSUS --oncokb-cancer-genes ONCOKB_GENES --min-overall-set-count 1 -o FULL_PIECHART_DIR
FULL_PIECHART_DIR
will now contain:
unexplained
(UNEXPLAINED_DIR
) containing unexplained subset of all single-read junctionsdevelopmental
(DEVELOPMENTAL_DIR
) containing developmentally-occurring subset of all single-read junctions
To generate two-sample minimum junction sets (used in Figures S3E and S3F), run:
python3 set_membership_annotation.py --db-path DB_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JX_DIR -g NON_CORE_NORMAL_DIR -p NON_TISSUE_MATCHED_NORMAL_DIR --gtf-file GENCODE_ANNOTATION_GTF --single-read-jx-json 1_READ_TCGA_JX_JSON --cancer-sra-directory SNAPTRON_CANCER_DIR --cancer-gene-census CANCER_GENE_CENSUS --oncokb-cancer-genes ONCOKB_GENES --min-overall-set-count 2 -o 2-SAMPLE_PIECHART_DIR
2-SAMPLE_PIECHART_DIR
will now contain:
unexplained
(UNEXPLAINED_2-SAMPLE_DIR
) containing unexplained subset of all two-sample junctionsdevelopmental
(DEVELOPMENTAL_2-SAMPLE_DIR
) containing developmentally-occurring subset of all two-sample junctions
-
To perform additional set membership analyses using
set_membership_analysis.py
(for use in Figures S1C and S1D), run:python3 set_membership_analysis.py -o OUTPUT_DIR -s FULL_PIECHART_DIR
FULL_PIECHART_DIR
will now contain:
unexplained
(UNEXPLAINED_DIR
) containing unexplained subset of all single-read junctionsdevelopmental
(DEVELOPMENTAL_DIR
) containing developmentally-occurring subset of all single-read junctionstrue_TCGA_prevalence_files
(TCGA_PREVALENCE_DIR
) containing non-core-normal junctions
To tally SRA experiments using count_unique_SRA_expts.py
, run:
python3 count_unique_SRA_expts.py -c SNAPTRON_CANCER_EXPTLIST_DIR -n SNAPTRON_NONCANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR
Fig 1A:
python3 fig1A_overall_set_barplots.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR
Fig 1B:
python3 fig1B_ncn_jx_counts_per_sample.py -j COUNTS_PER_SAMPLE_DIRECTORY -o FIGURE_OUTPUT_DIRECTORY
Fig 1C:
python3 fig1C_overall_set_prevalence_boxplot.py --set-memberships FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR
Fig 2A:
python3 fig2A_TCGA_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR
Fig 2B:
python3 fig2B_TCGA_subtypes_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR
Fig 2C:
python3 fig2C_cell_of_origin_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR
Figs 3A, S3A, S3E, and S3F:
(Contact on instructions for regenerating this figure: https://github.com/metamaden)
Instantiate the functions in upset_functions.R
, and make a new subdirectory (default name: cxdat
), and run get.datlist()
, specifying the name of the subdirectory containing the junction data tables. Once this has run, the working directory should contain a large list object containing the junction data tables, as well as a newly populated subdirectory (default: cxdat
). Depending on your session memory limit, you may need to remove the object cxdl
from your environment before proceeding.
Follow the remaining steps in upset_data.R
to: Define the new developmental
junction category; Run jx.calcsets()
for the supplemental data, then for the main figure data; Replace the unexplained
category in the main figure data with the supplemental figure data; Generate the supplemental data tables. To save memory, tables are read processively and processed.
Next, follow the steps in upset_plots_original.R
to: Specify the data table containing the whole/total group sets and subset group sets, respectively; Exclude the pre-filtered set columns from visualization; Properly order whole set columns by mean value; Automatically generate the axis limits and labels; Specify the group labels (should match whole set column labels after ordering on means); Generate the plot data objects; Generate the final plot images.
Following these steps should generate both a main and a supplemental upset plot with annotation set and subset abundances. It may be necessary to experiment with different plot and image dimensions in upset_plots_original.R
and make.ggset.final()
to refine the figure properties. For additional information, refer to the function docstrings and script comments.)
Fig 3B:
python3 fig3B_antisense_boxplot.py -o FIGURE_OUTPUT_DIR -s FULL_PIECHART_DIR
Fig S1A:
python3 figS1A_junction_burden_vs_TMB.py -o FIGURE_OUTPUT_DIR -j COUNTS_PER_SAMPLE_DIR -s RECOUNT_SAMPLE_IDS -p TCGA_PHEN -t TCGA_TMB
Fig S1B:
python3 fig1B_S1B_ncn_jx_counts_per_sample.py -d -j FILTERED_NCN_JX_PER_SAMPLE_DIR -p FILTERED_NTM_JX_PER_SAMPLE_DIR -d -g THYM CESC UVM DLBC --prepared-sort-order -o FIGURE_OUTPUT_DIR
Figs S1C and S1D:
python3 figS1CD_S3CD_junction_sharedness.py -d TCGA_PREVALENCE_DIR -o FIGURE_OUTPUT_DIR
Note: requires contents of TCGA_PREVALENCE_DIR
created by running set_membership_analysis.py in step 6 above.
Fig S1E:
python3 figS1E_set_prevalences_per_cancer.py -s FULL_PIECHART_DIR -o FIGURE_OUTPUT_DIR
Figs S1F, S1G, S1H, and S1I:
(Contact for instructions on regenerating this figure: https://github.com/weederb23)
-
Download the TCGA mutation files from http://gdac.broadinstitute.org/runs/stddata__2016_01_28/data/ for each cancer type and move into the sub-directory ./TCGA_mut_files/, where ./ is the location of "SF_mutation_script.R" and "SF_sharedness_plots.R". For full list of the direct file paths for mutation call file downloads see "mutation_call_file_list.txt"
-
Copy files from NON_CORE_NORMAL_JX_COORD_DIR to a new directory within ./ called "non_core_jxns/". Rename files from full cancer names to match TCGA cancer codes.
-
Copy files from [PIECHART DIRECTORY] to a new directory within ./ called "non_core_jxn_annotations/". Rename files from full cancer names to match TCGA cancer codes.
-
Run
Rscript SF_mutation_script.R
.
This generates fig_s1f.jpg
, fig_s1g.jpg
and shared_jxn_df.txt
. shared_jxn_df.txt
is used downstream for SF_sharedness_plots.R
.
If run interactively, also presents statistics for junction comparisons between patients with and without splicing factor mutations across cancers.
- Run
SF_sharedness_plots.R
.
This generates fig_s1h.jpg
and fig_s1i.jpg
. If run interactively, also presents statistics for jxn comparisons between patients with and without splicing factor mutations across cancers.
Fig S1J:
python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d PREVALENCE_FILE_DIR
Fig S2A:
python3 figS2A_full_TCGA_SRA_heatmap.py -d PREVALENCE_FILE_DIR -o FIGURE_OUTPUT_DIR --snaptron-results SNAPTRON_NONCANCER_DIR -e SNAPTRON_NONCANCER_EXPTLIST_DIR
Fig S2B:
python3 figS2B_SKCM_jx_similarity.py --snaptron-results SNAPTRON_NONCANCER_DIR -d ALL_JXS_PER_SAMPLE_DIR -o FIGURE_OUTPUT_DIR
Fig S3B:
python3 figS1J_S3B_SRA_cancer_shared_prevalence.py --snaptron-results SNAPTRON_CANCER_DIR -e SNAPTRON_CANCER_EXPTLIST_DIR -o FIGURE_OUTPUT_DIR -d UNEXPLAINED_PIECHART_DIR --unexplained-junctions
Fig S3C and S3D:
python3 figS1CD_S3CD_junction_sharedness.py -d UNEXPLAINED_PIECHART_DIR -o FIGURE_OUTPUT_DIR
Fig S3G:
python3 figS3G_prepare_survival_data.py -d DB_DIR -o FIGURE_OUTPUT_DIR -P TCGA_PHEN
Rscript figS3G_survival_curve.R FIGURE_OUTPUT_DIR