Skip to content

Latest commit

 

History

History
228 lines (149 loc) · 13.9 KB

workflow.md

File metadata and controls

228 lines (149 loc) · 13.9 KB

RNAsum data processing workflow

The description of the main workflow components involved in (1) read counts and gene fusions data collection, (2) read counts data processing, (3) integration with WGS-based data (processed using umccrise pipeline), (4) results annotation and (5) presentation in the Patient Transcriptome Summary report.


Table of contents

1. Data collection

Read counts data from patient sample are collected from bcbio-nextgen RNA-seq or DRAGEN RNA pipeline.

2. Data processing

Counts processing

The read count data (see Input data section in the main page) in abundance.tsv or quant.sf quantification files from kallisto or salmon, respectively, are processed following steps illustrated in Figure 1 and described below.

Figure 1

Counts processing scheme.

Data collection

(Figure 1A)

  • Load read count files from the following three sets of data:

    1. patient sample (see Input data section in the main page)
    2. external reference cohort (TCGA, available cancer types are listed in TCGA projects summary table) corresponding to the patient cancer sample
    3. UMCCR internal reference set of in-house pancreatic cancer samples (regardless of the patient sample origin; see Input data section in the main page)

Transformation

(Figure 1B)

  • Subset datasets to include common genes
  • Combine patient sample and internal reference dataset
  • Convert counts to CPM (Counts Per Million; default) or TPM (Transcripts Per Kilobase Million) values in:
    1. sample + internal reference set
    2. external reference set

Filtering (optional)

(Figure 1C)

  • Filter out genes with low counts (CPM or TPM < 1 in more than 90% of samples) in:
    1. sample + internal reference set
    2. external reference set

Normalisation (optional)

(Figure 1D)

  • Normalise data (see Arguments section in the main page for available options) for sample-specific effects in:
    1. sample + internal reference set
    2. external reference set

Combination

(Figure 1E)

  • Subset datasets to include common genes
  • Combine sample + internal reference set with external reference set

Batch-effects correction (optional)

(Figure 1F)

  • Consider the patient sample + internal reference (regardless of the patient sample origin) as one batch (both sets processed with the same pipeline) and corresponding TCGA dataset as another batch. The objective is to remove data variation due to technical factors.

Data scaling

The processed count data is scaled to facilitate expression values interpretation. The data is either scaled gene-wise (Z-score transformation, default) or group-wise (centering).

Gene-wise

Z-scores are comparable by measuring the observations in multiples of the standard deviation of given sample. The gene-wise Z-score transformation procedure is illustrated in Figure 2 and is described below.

Figure 2

Gene-wise Z-score transformation scheme.

  • Extract expression values across all samples for a given gene (Figure 2A)

  • Compute Z-scores for individual samples (see equation in (Figure 2B)

  • Compute median Z-scores for (Figure 2C):

    1. internal reference set*
    2. external reference set
  • Present patient sample Z-score in the context the reference cohorts' median Z-scores (Figure 2D)

* used only for pancreatic cancer patients

Group-wise

The group-wise centering apporach is presented in Figure 3 and is described below.

Figure 3

Group-wise centering scheme.

  • Extract expression values for (Figure 3A):

    1. patient sample
    2. internal reference set*
    3. external reference set
  • For each gene compute median expression value in (Figure 3B):

    1. internal reference set*
    2. external reference set
  • Center the median expression values for each gene in individual groups (Figure 3C)

  • Present patient sample centered expression values in the context the reference cohorts' centered values (Figure 3D)

* used only for pancreatic cancer patients

3. Integration with WGS-based results

For patients with available WGS data processed using umccrise pipeline (see --umccrise argument) the expression level information for mutated genes or genes located within detected structural variants (SVs) or copy-number (CN) altered regions, as well as the genome-based findings are incorporated and used as primary source for expression profiles prioritisation.

Somatic SNVs and small indels

Structural variants

  • Check if Manta output file (see example) is available
  • Extract expression level information and genome-based findings for genes located within detected SVs
  • Ordered genes by increasing SV score and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort
  • Compare gene fusions detected in WTS data (arriba and pizzly) and WGS data (Manta)
  • Priritise WGS-supported gene fusions

Somatic CNVs

  • Check if PURPLE output file (see example) is available
  • Extract expression level information and genome-based findings for genes located within detected CNVs (use --cn_loss and --cn_gain arguments to define CN threshold values to classify genes within lost and gained regions)
  • Ordered genes by increasing (for genes within lost regions) or decreasing (for genes within gained regions) CN and then by decreasing absolute values representing difference between expression levels in the patient sample and the corresponding reference cohort

4. Results annotation

WTS- and/or WGS-based results for the altered genes are collated with knowledge derived from in-house resources and public databases (listed below) to provide additional source of evidence for their significance, e.g. to flag variants with clinical significance or potential druggable targets.

Key cancer genes

OncoKB

  • OncoKB gene list is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)

VICC

  • Variant Interpretation for Cancer Consortium (VICC) knowledgebase is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)

CIViC

  • The Clinical Interpretation of Variants in Cancer (CIViC) database is used to annotate altered genes across various sections in the report (annotations and URL links in External resources column in report Summary tables)
  • Used to flag clinically actionable aberrations in the Drug matching report section

CGI

FusionGDB

  • FusionGDB database is used to flag genes known to be involved in gene fusions and to prioritise candidate gene fusions

5. Report generation

The final html-based Patient Transcriptome Summary report contains searchable tables and interactive plots presenting expression levels of altered genes, as well as links to public resources providing additional source of evidence for their significance. The individual report sections, results prioritisation and visualisation are described more in detail in report_structure.md.