Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

dorado for m6A analysis #1170

Closed
baibhav-bioinfo opened this issue Dec 10, 2024 · 15 comments
Closed

dorado for m6A analysis #1170

baibhav-bioinfo opened this issue Dec 10, 2024 · 15 comments

Comments

@baibhav-bioinfo
Copy link

baibhav-bioinfo commented Dec 10, 2024

Hello,
Appreciation for developing the valuable and diverse software.

I am trying to use Dorado for m6A analysis (using DRS data), from site identification to Diff Methylated Rate analysis between conditions with replicates (using modkit).

(1) i wanted to know if
dorado basecaller [email protected] /path/to/pod5
--modified-bases-models [email protected]_m6A_DRACH@v1/ --device cude:all --reference ref.fasta > calls.bam

this is correct approach for alignment along with basecalling.

also as other tools (eg. m6Anet) uses transcriptome rather than genome. which one would be more suitable to align with?.

(2) The resultant bed file (after modkit pileup) contains position with Nmod=0. Why are those in the output when my argument is to detect the m6A mod sites? do i need to filter out those, keeping only Nmod>=1?
also the file have Nother_mod, what are those and why are we getting other mods?
is there any option to get only relevant Nmod rows, in my case A?

@malton-ont
Copy link
Collaborator

Hi @baibhav-bioinfo,

  1. Yes. See the docs here. Regarding which reference to use, bioinformatics questions are better posted on the Nanopore community forums. I'm not familiar with m6Anet.
  2. This question would be better posted on the modkit github page, but at a guess I'd say you are using a model that only calls m6A in a DRACH motif, so perhaps modkit is showing Nmod=0 for A bases not in that context?

@baibhav-bioinfo
Copy link
Author

Okay @malton-ont thanks for the response.
I will see the docs for more information and post in nanopore community for more answers.
Thanks for your insight Regarding the Nmod=0 rows, it makes sense.

@baibhav-bioinfo
Copy link
Author

hello,
I was able to run the Dorado basecall + methylation call with my DRS reads. I ran the command both in m6A_DRACH context and m6A_all context, using the following commands

dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_m6A_DRACH@v1/ --device cuda:all > sample.bam

dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_m6A@v1/ --device cuda:all > sample.bam

for the DRACH motif context i got 60,000 sites per sample (10 million reads with ~1000nt per read)
but for all context run, i got >1.6 million m6A sites predicted.

Did something wrong happened in my ALL context run?
Is that number of sites possible?

Please suggest.

@malton-ont
Copy link
Collaborator

Hi @baibhav-bioinfo,

By default dorado outputs predictions in all-context for any site with a >5% chance of being a modification. You can adjust this using the --modified-bases-threshold parameter if you prefer to be more conservative, or I believe modkit has its own filtering functionality which you can ask about on their github page.

@baibhav-bioinfo
Copy link
Author

hi
i ran the dorado command with "--modified-bases-threshold" with different threshold values
with 0.90 i got no m6A sites predicted, then i lowered threshold to 0.45, but got no sites too

with 0.25 i got again >1 million sites.

is there anything wrong i am doing?
following is the command i am using

$dorado basecaller [email protected] pod5/ --modified-bases-models
[email protected]_m6A@v1/ --modified-bases-threshold 0.90 --device cuda:all >
c6_r1_sup_m6A_90.bam

@malton-ont
Copy link
Collaborator

There doesn't look to be anything wrong with your command as far as I can see.

Is 1.6 million sites not reasonable? That's ~30x the number you see for the DRACH context - could it just be that only 1/30 of the A-bases are in a DRACH context?

@baibhav-bioinfo
Copy link
Author

but all the literature and research articles mentions majority of the m6A sites are in DRACH motif, with very few exceptions.

maybe i will need to see other filters, using modkit. I will read more and let you know if find anything.

Thankyou so much for the prompt responses.

@malton-ont
Copy link
Collaborator

The all-context model reports modifications at all A-bases, regardless of the context, so I'd expect more sites to be recorded. From your reading it sounds like these should have a relatively low probability outside the DRACH context, which tracks with you seeing few sites when applying a high threshold value. Beyond this, I think you'll get better answers elsewhere, as these are more bioinformatics questions than issues with dorado itself.

@baibhav-bioinfo
Copy link
Author

You are right, i will write on nanopore community for better explanations.

@baibhav-bioinfo
Copy link
Author

hello, as you mentioned in the previous chats the default threshold for m6A site detection in "all context" is 5% or 0.05.
can we know what is the default threshold for m6A detection in DRACH context?

@malton-ont
Copy link
Collaborator

malton-ont commented Jan 22, 2025

@baibhav-bioinfo,

No thresholds are applied in the case of context-based models. This is because there is no way to distinguish via the MM/ML tags between bases that are skipped because they did not match the context and bases that would be skipped because they did not meet the threshold.

@baibhav-bioinfo
Copy link
Author

baibhav-bioinfo commented Jan 22, 2025

thanks @malton-ont, i will try to understand what you just said.

There is also one question, might be trivial. These are the commands i ran for both DRACH and all context
(1) First ran dorado for basecall the sequences and methylation (without reference)
input: pod5 files
output: bam files
dorado basecaller [email protected] pod5/ --modified-bases-models [email protected]_m6A_DRACH@v1/ --device cuda:all > c6_r1_sup_.bam
(2) then i ran the "dorado aligner" for aligning with reference genome with "-x splice" option for DRS reads
dorado aligner sb_genome.mmi c6_r1_sup_.bam --output-dir aligned_output_dorado_sup --mm2-opts "-x splice -k 14 --junc-bed SbicolorRTx430_552_v2.1.gene.bed"
input: bam files
output: aligned bam files
(3) Using modkit pileup to get bedmethyl files
input: aligned bam files
output: bed files
modkit pileup --threads 96 c6_r2_sup_.bam c6_r2_sup_DRACH_pileup.bed --log-filepath c6_r2_sup_DRACH_pileup.log

I inspected the bam files from dorado. the bam files just from step 1 have the MM/ML tags (which tells us about the m6A sites detected), but the aligned bam files from step 2 doesnt have it. That makes me little confused where did the modification information go.

@malton-ont
Copy link
Collaborator

malton-ont commented Jan 22, 2025

@baibhav-bioinfo,

dorado aligner strips modification information from non-primary alignments unless soft-clipping is used, as hard-clipping would make the tags and the sequence incompatible. You can turn on soft-clipping by adding -Y to your --mm2-opts.

@baibhav-bioinfo
Copy link
Author

Thanks @malton-ont , now I can see what is happening there during alignment.
I was searching for any research papers which have used Dorado for m6A analysis for using it as reference guide in my case. If there are any in your knowledge, please let me know.

@malton-ont
Copy link
Collaborator

@baibhav-bioinfo,

You'll likely get a better response to paper requests on the Nanopore community forums.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants