RNAsum data processing workflow

The description of the main workflow components involved in (1) read counts and gene fusions data collection, (2) read counts data processing, (3) integration with WGS-based data (processed using umccrise pipeline), (4) results annotation and (5) presentation in the Patient Transcriptome Summary report.

1. Data collection

Read counts data from patient sample are collected from bcbio-nextgen RNA-seq or DRAGEN RNA pipeline.

2. Data processing

Counts processing

The read count data (see Input data section in the main page) in abundance.tsv or quant.sf quantification files from kallisto or salmon, respectively, are processed following steps illustrated in Figure 1 and described below.

Figure 1

Counts processing scheme.

Data collection

(Figure 1A)

Load read count files from the following three sets of data:
1. patient sample (see Input data section in the main page)
2. external reference cohort (TCGA, available cancer types are listed in TCGA projects summary table) corresponding to the patient cancer sample
3. UMCCR internal reference set of in-house pancreatic cancer samples (regardless of the patient sample origin; see Input data section in the main page)

Transformation

(Figure 1B)

Subset datasets to include common genes
Combine patient sample and internal reference dataset
Convert counts to CPM (Counts Per Million; default) or TPM (Transcripts Per Kilobase Million) values in:
1. sample + internal reference set
2. external reference set

Filtering (optional)

(Figure 1C)

Filter out genes with low counts (CPM or TPM < 1 in more than 90% of samples) in:
1. sample + internal reference set
2. external reference set

Normalisation (optional)

(Figure 1D)

Normalise data (see Arguments section in the main page for available options) for sample-specific effects in:
1. sample + internal reference set
2. external reference set

Combination

(Figure 1E)

Subset datasets to include common genes
Combine sample + internal reference set with external reference set

Batch-effects correction (optional)

(Figure 1F)

Consider the patient sample + internal reference (regardless of the patient sample origin) as one batch (both sets processed with the same pipeline) and corresponding TCGA dataset as another batch. The objective is to remove data variation due to technical factors.

Data scaling

The processed count data is scaled to facilitate expression values interpretation. The data is either scaled gene-wise (Z-score transformation, default) or group-wise (centering).

Gene-wise

Z-scores are comparable by measuring the observations in multiples of the standard deviation of given sample. The gene-wise Z-score transformation procedure is illustrated in Figure 2 and is described below.

Figure 2

Gene-wise Z-score transformation scheme.

Extract expression values across all samples for a given gene (Figure 2A)
Compute Z-scores for individual samples (see equation in (Figure 2B)
Compute median Z-scores for (Figure 2C):
1. internal reference set*
2. external reference set
Present patient sample Z-score in the context the reference cohorts' median Z-scores (Figure 2D)

* used only for pancreatic cancer patients

Group-wise

The group-wise centering apporach is presented in Figure 3 and is described below.

Figure 3

Group-wise centering scheme.

Extract expression values for (Figure 3A):
1. patient sample
2. internal reference set*
3. external reference set
For each gene compute median expression value in (Figure 3B):
1. internal reference set*
2. external reference set
Center the median expression values for each gene in individual groups (Figure 3C)
Present patient sample centered expression values in the context the reference cohorts' centered values (Figure 3D)

* used only for pancreatic cancer patients

3. Integration with WGS-based results

For patients with available WGS data processed using umccrise pipeline (see --umccrise argument) the expression level information for mutated genes or genes located within detected structural variants (SVs) or copy-number (CN) altered regions, as well as the genome-based findings are incorporated and used as primary source for expression profiles prioritisation.

Somatic SNVs and small indels

Check if PCGR output file (see example) is available
Extract expression level information and genome-based findings for genes with detected genomic variants (use --pcgr_tier argument to define [tier](https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg threshold value)
Ordered genes by increasing variants [tier](https://pcgr.readthedocs.io/en/latest/tier_systems.html#tier-model-2-pcgr-acmg and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort

Structural variants

Check if Manta output file (see example) is available
Extract expression level information and genome-based findings for genes located within detected SVs
Ordered genes by increasing SV score and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort
Compare gene fusions detected in WTS data (arriba and pizzly) and WGS data (Manta)
Priritise WGS-supported gene fusions

Somatic CNVs

Check if PURPLE output file (see example) is available
Extract expression level information and genome-based findings for genes located within detected CNVs (use --cn_loss and --cn_gain arguments to define CN threshold values to classify genes within lost and gained regions)
Ordered genes by increasing (for genes within lost regions) or decreasing (for genes within gained regions) CN and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort

4. Results annotation

WTS- and/or WGS-based results for the altered genes are collated with knowledge derived from in-house resources and public databases (listed below) to provide additional source of evidence for their significance, e.g. to flag variants with clinical significance or potential druggable targets.

Key cancer genes

UMCCR key cancer genes set build of off several sources:
- Cancermine with at least 2 publication with at least 3 citations
- NCG known cancer genes
- Tier 1 COSMIC Cancer Gene Census (CGC)
- CACAO hotspot genes (curated from ClinVar, CiViC, Cancer Hotspots)
- At least 2 matches in the following 5 sources and 8 clinical panels:
  - Cancer predisposition genes (CPSR list)
  - COSMIC Cancer Gene Census (tier 2)
  - AstraZeneca 300 (AZ300)
  - Familial Cancer
  - OncoKB annotated
  - MSKC-IMPACT
  - MSKC-Heme
  - PMCC-CCP
  - Illumina-TS500
  - TEMPUS
  - Foundation One
  - Foundation Heme
  - Vogelstein
Used for extracting expression levels of cancer genes (presented in the Cancer genes report section)
Used to prioritise candidate fusion genes

OncoKB

OncoKB gene list is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)

VICC

Variant Interpretation for Cancer Consortium (VICC) knowledgebase is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)

CIViC

The Clinical Interpretation of Variants in Cancer (CIViC) database is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)
Used to flag clinically actionable aberrations in the Drug matching report section

CGI

The Cancer Genome Interpreter (CGI) database is used to flag genes known to be involved in gene fusions and to prioritise candidate fusion genes

FusionGDB

FusionGDB database is used to flag genes known to be involved in gene fusions and to prioritise candidate gene fusions

5. Report generation

The final html-based Patient Transcriptome Summary report contains searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources providing additional source of evidence for their significance. The individual report sections, results prioritisation and visualisation are described more in detail in report_structure.md.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

workflow.md

workflow.md

RNAsum data processing workflow

Table of contents

1. Data collection

2. Data processing

Counts processing

Figure 1

Data collection

Transformation

Filtering (optional)

Normalisation (optional)

Combination

Batch-effects correction (optional)

Data scaling

Gene-wise

Figure 2

Group-wise

Figure 3

3. Integration with WGS-based results

Somatic SNVs and small indels

Structural variants

Somatic CNVs

4. Results annotation

Key cancer genes

OncoKB

VICC

CIViC

CGI

FusionGDB

5. Report generation

Files

workflow.md

Latest commit

History

workflow.md

File metadata and controls

RNAsum data processing workflow

Table of contents

1. Data collection

2. Data processing

Counts processing

Figure 1

Data collection

Transformation

Filtering (optional)

Normalisation (optional)

Combination

Batch-effects correction (optional)

Data scaling

Gene-wise

Figure 2

Group-wise

Figure 3

3. Integration with WGS-based results

Somatic SNVs and small indels

Structural variants

Somatic CNVs

4. Results annotation

Key cancer genes

OncoKB

VICC

CIViC

CGI

FusionGDB

5. Report generation