Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

File "salmon.merged.gene_counts.tsv" Contains Floating-Point Values Instead of Raw Integer Counts #1491

Open
Aeget1000 opened this issue Jan 22, 2025 · 2 comments
Labels
bug Something isn't working

Comments

@Aeget1000
Copy link

Description of the bug

Dear nf-core team,

First of all, I would like to express my gratitude for this incredible community and the nf-core initiative as a cornerstone for reproducibility in computational biology.

I am reaching out to highlight an issue I encountered in versions 3.12.0 and 3.18.0.

Specifically, the file "salmon.merged.gene_counts.tsv" does not appear to contain raw integer counts. Instead, it contains what seem to be floating-point values, as illustrated in the screenshot below:

Image

I have attached the relevant files below (v 3.18.0) and am working with the following sample:
SRX1874029 - NCBI SRA

Thank you for your attention to this matter. I look forward to your insights and any potential solutions.

Best regards,
Christian Andersen

Command used and terminal output

Input/output options
  input          : /work/samplesheet_1_sample_peters.csv
  outdir         : results_peters_1_sample_test

Reference genome options
  genome         : GRCh38
  fasta          : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Sequence/WholeGenomeFasta/genome.fa
  gtf            : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.gtf
  gene_bed       : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Annotation/Genes/genes.bed
  star_index     : s3://ngi-igenomes/igenomes//Homo_sapiens/NCBI/GRCh38/Sequence/STARIndex/

Alignment options
  pseudo_aligner : salmon

Process skipping options
  skip_gtf_filter: true
  skip_alignment : true
  skip_deseq2_qc : true

Core Nextflow options
  revision       : 3.18.0
  runName        : clever_meninsky
  launchDir      : /work
  workDir        : /tmp/work
  projectDir     : /home/ucloud/.nextflow/assets/nf-core/rnaseq
  userName       : ucloud
  profile        : mamba
  configFiles    :

Relevant files

JobParameters.json
stdout.txt
samplesheet_1_sample_peters.csv

System information

Hardware: Ucloud
OS: Linux

@Aeget1000 Aeget1000 added the bug Something isn't working label Jan 22, 2025
@davidecarlson
Copy link

Hi @Aeget1000,

The read counts in the salmon.merged.gene_counts.tsv output file not being integers is expected. This is how salmon reports the results in the quant.sf and quant.genes.sf files for each sample.

See the Salmon docs:

NumReads — This is salmon’s estimate of the number of reads mapping to each transcript that was quantified. It is an “estimate” insofar as it is the expected number of reads that have originated from each transcript given the structure of the uniquely mapping and multi-mapping reads and the relative abundance estimates for each transcript.

Also, see this comment from Rob Patro, the main Salmon developer, on the rationale behind this (emphasis added):

Regarding outputting "original read counts"; salmon does output the estimates for the number of reads deriving from each transcript. If the question is, why is this number not an integer, that's because the best estimate (the maximum likelihood estimate) is often not integral. Tools that simply count reads (e.g. HTSeq) produce integer counts, but these are in no way "original read counts" for the corresponding genes, and are usually less accurate (farther from the true number of fragments deriving from a transcript / gene) than the estimates produced by salmon. The fact that the best estimate is often not an integer is a direct result of the fact one is considering a statistical model and taking expectations.

Best,
Dave

@Aeget1000
Copy link
Author

Hi @davidecarlson

Thank you for your response

It now makes sense why the salmon.merged.gene_counts.tsv output is estimated fractions and not integers

Would It be possible to add this information to the documentation page on: https://nf-co.re/rnaseq/3.18.0/docs/output/#pseudoalignment-and-quantification

As others might get confused by the statement: "Matrix of gene-level raw counts across all samples."

Best,
Christian Andersen

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants