File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

Aeget1000 · 2025-01-22T16:19:31Z

Description of the bug

Dear nf-core team,

First of all, I would like to express my gratitude for this incredible community and the nf-core initiative as a cornerstone for reproducibility in computational biology.

I am reaching out to highlight an issue I encountered in versions 3.12.0 and 3.18.0.

Specifically, the file "salmon.merged.gene_counts.tsv" does not appear to contain raw integer counts. Instead, it contains what seem to be floating-point values, as illustrated in the screenshot below:

I have attached the relevant files below (v 3.18.0) and am working with the following sample:
SRX1874029 - NCBI SRA

Thank you for your attention to this matter. I look forward to your insights and any potential solutions.

Best regards,
Christian Andersen

Command used and terminal output

Input/output options
  input          : /work/samplesheet_1_sample_peters.csv
  outdir         : results_peters_1_sample_test

Reference genome options
  genome         : GRCh38
  fasta          : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
  gtf            : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf
  gene_bed       : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed
  star_index     : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/

Alignment options
  pseudo_aligner : salmon

Process skipping options
  skip_gtf_filter: true
  skip_alignment : true
  skip_deseq2_qc : true

Core Nextflow options
  revision       : 3.18.0
  runName        : clever_meninsky
  launchDir      : /work
  workDir        : /tmp/work
  projectDir     : /home/ucloud/.nextflow/assets/nf-core/rnaseq
  userName       : ucloud
  profile        : mamba
  configFiles    :

Relevant files

JobParameters.json
stdout.txt
samplesheet_1_sample_peters.csv

System information

Hardware: Ucloud
OS: Linux

davidecarlson · 2025-01-28T16:36:41Z

Hi @Aeget1000,

The read counts in the salmon.merged.gene_counts.tsv output file not being integers is expected. This is how salmon reports the results in the quant.sf and quant.genes.sf files for each sample.

See the Salmon docs:

NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

Also, see this comment from Rob Patro, the main Salmon developer, on the rationale behind this (emphasis added):

Regarding outputting "original read counts"; salmon does output the estimates for the number of reads deriving from each transcript. If the question is, why is this number not an integer, that's because the best estimate (the maximum likelihood estimate) is often not integral. Tools that simply count reads (e.g. HTSeq) produce integer counts, but these are in no way "original read counts" for the corresponding genes, and are usually less accurate (farther from the true number of fragments deriving from a transcript / gene) than the estimates produced by salmon. The fact that the best estimate is often not an integer is a direct result of the fact one is considering a statistical model and taking expectations.

Best,
Dave

Aeget1000 · 2025-01-29T14:26:10Z

Hi @davidecarlson

Thank you for your response

It now makes sense why the salmon.merged.gene_counts.tsv output is estimated fractions and not integers

Would It be possible to add this information to the documentation page on: https://nf-co.re/rnaseq/3.18.0/docs/output/#pseudoalignment-and-quantification

As others might get confused by the statement: "Matrix of gene-level raw counts across all samples."

Best,
Christian Andersen

Aeget1000 added the bug Something isn't working label Jan 22, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

Aeget1000 commented Jan 22, 2025

davidecarlson commented Jan 28, 2025

Aeget1000 commented Jan 29, 2025

File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

Comments

Aeget1000 commented Jan 22, 2025

Description of the bug

Command used and terminal output

Relevant files

System information

davidecarlson commented Jan 28, 2025

Aeget1000 commented Jan 29, 2025