Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Could dorado --estimate-poly-a provides the start and end of the estimated polyA(T) tail positions? #1235

Open
abcdtree opened this issue Jan 28, 2025 · 5 comments
Labels
enhancement New feature or request polyA Issue related to polyA tail estimation

Comments

@abcdtree
Copy link

Issue Report

Please describe the issue:

Just to confirm whether there is a way to get the start and end positions of the estimated polyA(T) tails from the basecalling output which corresponding to the pt:i tag result.

Thanks,

Josh

@malton-ont
Copy link
Collaborator

Hi @abcdtree,

This information is not output to the BAM file at present - we can discuss this internally. It is possible to extract this from the verbose logs (obtained by running with -vv 2> log.log). Be aware that this can generate a lot of output though!

@malton-ont malton-ont added enhancement New feature or request polyA Issue related to polyA tail estimation labels Jan 29, 2025
@kpors
Copy link

kpors commented Feb 4, 2025

Hi @malton-ont

Thanks for the tip with -vv 2> log.log!
I am also interested in locating the polyA tail.

Below is an example from my log from a read.
Am I understanding the log right, if I conclude the polyA signal starts at position 69 and ends at 1819, and the length estimation is 19 bases?
Additionally, does trim 2300 mean that the first 2300 signals are trimmed away? I would be surprise because I asked dorado not to trim my reads (--no-trim). I need the data from the adapter.

In the log.log I find this information:
[2025-02-04 09:46:15.628] [trace] found intervals 69-1819,
[2025-02-04 09:46:15.730] [trace] clustered intervals 69-1819,
[2025-02-04 09:46:15.730] [trace] filtered intervals 69-1819,
[2025-02-04 09:46:15.730] [trace] Anchor 0 Range 69 1819
[2025-02-04 09:46:15.730] [trace] 003dfa70-b90a-42d6-a9ac-7c327d21dfbc PolyA bases 19, signal anchor 0 Signal range is 69 1819 Signal length 1743, samples/base 93.78063 trim 2300 read len 16

Thanks,
Kristoffer

@malton-ont
Copy link
Collaborator

@kpors,

Yes, this is the relevant line:

[2025-02-04 09:46:15.730] [trace] 003dfa70-b90a-42d6-a9ac-7c327d21dfbc PolyA bases 19, signal anchor 0 Signal range is 69 1819 Signal length 1743, samples/base 93.78063 trim 2300 read len 16

If you're trying to map this back to your pod5 file, you need to add the trimmed samples value on to the anchor and range values. I assume you are running an RNA model - dorado always trims adapters for the RNA model (regardless of the trim settings), since the adapter is made of DNA and can't be sensibly basecalled by the RNA model. From the CLI help:

  --trim                      Specify what to trim. Options are 'none', 'all', 'adapters', and 'primers'.
                                  Default behaviour is to trim all detected adapters, primers, or barcodes.
                                  Choose 'adapters' to just trim adapters.
                                  The 'primers' choice will trim adapters and primers, but not barcodes.
                                  The 'none' choice is equivelent to using --no-trim.
                                  Note that this only applies to DNA. RNA adapters are always trimmed.  <-----
                                  [nargs=0..1] [default: ""]

@kpors
Copy link

kpors commented Feb 5, 2025

Thanks @malton-ont

Just to be sure, that I understand the line below correct:

PolyA bases 19: PolyA tail 19 bp
signal anchor 0: Anchor position is 0
Signal range is 69 1819: PolyA signal starts at position 69 and ends at position 1819
Signal length 1743: PolyA 1743 samples/data points long? But 1819-69 is 1750 - I do not understand this
samples/base 93.78063: Speed of the sequencing
trim 2300: 2300 samples/data points before anchor point trimmed away? The adapter sequence that is trimmed away?
read len 16: ?

[2025-02-04 09:46:15.730] [trace] 003dfa70-b90a-42d6-a9ac-7c327d21dfbc PolyA bases 19, signal anchor 0 Signal range is 69 1819 Signal length 1743, samples/base 93.78063 trim 2300 read len 16

Thanks,
Kristoffer

@malton-ont
Copy link
Collaborator

Hi @kpors,

[2025-02-04 09:46:15.730] [trace] 003dfa70-b90a-42d6-a9ac-7c327d21dfbc PolyA bases 19, signal anchor 0 Signal range is 69 1819 Signal length 1743, samples/base 93.78063 trim 2300 read len 16

trim 2300: From the signal in the original pod5 file, 2300 samples were removed
signal anchor 0: In the remaining signal, the anchor point around which we search for a polyA signal begins at sample 0 - this is typically 0 for RNA (where the anchor point is the same as the adapter trim point) but may differ for cDNA or plasmids
Signal range is 69 1819: The region in the remaining signal that we determined to be characteristic of polyA signal
Signal length 1743: We then take that region and apply some empirical calibration coefficients (particularly in RNA, short polyA tails tend to be over-called so we artificially reduce these - see the code here)
read len 16: The number of bases in the called sequence. This includes any bases called as part of the polyA tail, but note that the number of A-bases in the actual sequence may not match the value calculated by the polyA estimation. You can see this very well here, where dorado has determined that the polyA tail is 19 bases long, even though the actual basecalled sequence is only 16 bases in its entirety.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request polyA Issue related to polyA tail estimation
Projects
None yet
Development

No branches or pull requests

3 participants