Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Replace BlastN by Minimap2 splice. #104

Closed
wants to merge 13 commits into from

Conversation

reichan1998
Copy link
Contributor

@reichan1998 reichan1998 commented Aug 6, 2024

Fixes #66

  • Replace BlastN step in PacBio read filtering by Minimap2 splice.
  • Change vector_db in nextflow.config from vectorDB.tar.gz to pacbio_vectorDB.fasta.

PR checklist

  • This comment contains a description of changes (with reason).
  • If you've fixed a bug or added code that should be tested, add tests!
  • If you've added a new tool - have you followed the pipeline conventions in the contribution docs
  • Make sure your code lints (nf-core lint).
  • Ensure the test suite passes (nextflow run . -profile test,docker --outdir <OUTDIR>).
  • Usage Documentation in docs/usage.md is updated.
  • Output Documentation in docs/output.md is updated.
  • CHANGELOG.md is updated.
  • README.md is updated (including new tool citations and authors/contributors).

@reichan1998 reichan1998 linked an issue Aug 6, 2024 that may be closed by this pull request
@tkchafin tkchafin self-requested a review August 7, 2024 12:18
Copy link
Contributor

@tkchafin tkchafin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think you should be able to do this using the existing minimap2_align nf-core module, and setting splice using ext.args as you are currently doing? Should also check if we can store and provide a gzipped fasta, and if there is any benefit to indexing it before running.

@reichan1998
Copy link
Contributor Author

Thank you for your feedback! I will try and enhance the process.

@reichan1998
Copy link
Contributor Author

Hi @tkchafin,

I have set the splice option using ext.args in the nf-core module minimap2 align. Checking the splice command input options, I found that it can accept both gzipped files of vector_db and fasta files. I indexed vector_db before running the analysis, but it doesn't seem to have had a significant impact on running time or resource usage. I have attached the execution report for your review.

assets/pacbio_vectorDB.fasta Outdated Show resolved Hide resolved
conf/modules.config Show resolved Hide resolved
docs/output.md Show resolved Hide resolved
nextflow.config Show resolved Hide resolved
nextflow_schema.json Show resolved Hide resolved
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We should not be making changes to any nf-core modules, because these will be overwritten when the module updates. Any specific runtime parameters should only be passed using the ext in modules.config.

@tkchafin
Copy link
Contributor

I don't think you should need to modify the accepted arguments for nf-core/minimap2_align, because in the case we are using it in pacbio_filter.nf would the database not be passed using the already existing reference argument? You should just be able to format that using map and pass as it is. In practice we do not like to make any changes to the nf-core modules because these will be overwritten down the line when running updates

@reichan1998
Copy link
Contributor Author

Thank you very much for your feedback @tkchafin. I'm working on it now.

@tkchafin
Copy link
Contributor

I guess we should maybe think more about the presets, we are using splice because this was suggested in the issue, but it looks like align-hifi might actually be more appropriate. We need to look into what these presets actually do

# use presets (no test data)
./minimap2 -ax map-pb ref.fa pacbio.fq.gz > aln.sam       # PacBio CLR genomic reads
./minimap2 -ax map-ont ref.fa ont.fq.gz > aln.sam         # Oxford Nanopore genomic reads
./minimap2 -ax map-hifi ref.fa pacbio-ccs.fq.gz > aln.sam # PacBio HiFi/CCS genomic reads (v2.19 or later)
./minimap2 -ax lr:hq ref.fa ont-Q20.fq.gz > aln.sam       # Nanopore Q20 genomic reads (v2.27 or later)
./minimap2 -ax sr ref.fa read1.fa read2.fa > aln.sam      # short genomic paired-end reads
./minimap2 -ax splice ref.fa rna-reads.fa > aln.sam       # spliced long reads (strand unknown)
./minimap2 -ax splice -uf -k14 ref.fa reads.fa > aln.sam  # noisy Nanopore Direct RNA-seq
./minimap2 -ax splice:hq -uf ref.fa query.fa > aln.sam    # Final PacBio Iso-seq or traditional cDNA
./minimap2 -ax splice --junc-bed anno.bed12 ref.fa query.fa > aln.sam  # prioritize on annotated junctions
./minimap2 -cx asm5 asm1.fa asm2.fa > aln.paf             # intra-species asm-to-asm alignment
./minimap2 -x ava-pb reads.fa reads.fa > overlaps.paf     # PacBio read overlap
./minimap2 -x ava-ont reads.fa reads.fa > overlaps.paf    # Nanopore read overlap

@reichan1998
Copy link
Contributor Author

reichan1998 commented Aug 14, 2024

Thank you, Tyler, for pointing this out. I was also confused about the preset.
I changed task.ext.args to -x splice:hq Long-read splice alignment for PacBio CCS reads (-xsplice -C5 -O6,24 -B4) in the commit following the document here.
Here is -x splice:hq result for mMelMel3_T4:

m64094_200911_174739/112397133/ccs	16666	16608	16665	+	bcAd1022T	78	3	60	57	57	16	tp:A:P	cm:i:16	s1:i:57	s2:i:47	dv:f:0.0040	rl:i:0
m64094_200911_174739/112397133/ccs	16666	16618	16665	+	bcAd1020T	78	13	60	47	47	0	tp:A:S	cm:i:13	s1:i:47	dv:f:0.0049	rl:i:0
m64094_200911_174739/112397133/ccs	16666	16620	16665	+	bcAd1016T	78	15	60	45	45	0	tp:A:S	cm:i:12	s1:i:45	dv:f:0.0053	rl:i:0
m64094_200911_174739/112397133/ccs	16666	16620	16665	+	bcAd1015T	78	15	60	45	45	0	tp:A:S	cm:i:12	s1:i:45	dv:f:0.0053	rl:i:0
m64094_200911_174739/112397133/ccs	16666	16620	16665	+	bcAd1011T	78	15	60	45	45	0	tp:A:S	cm:i:12	s1:i:45	dv:f:0.0053	rl:i:0
m64094_200911_174739/112397133/ccs	16666	16620	16665	+	bcAd1001T	78	15	60	45	45	0	tp:A:S	cm:i:12	s1:i:45	dv:f:0.0053	rl:i:0
m64094_200911_174739/125897038/ccs	12653	81	152	-	bcAd1022T	78	3	74	71	71	15	tp:A:P	cm:i:20	s1:i:71	s2:i:62	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	3	74	-	bcAd1022T	78	3	74	71	71	35	tp:A:P	cm:i:20	s1:i:71	s2:i:50	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	17	125	-	MG551957.1	4034	2	2049	72	2047	0	tp:A:S	cm:i:18	s1:i:62	dv:f:0.0290	rl:i:0
m64094_200911_174739/125897038/ccs	12653	92	142	-	bcAd1020T	78	13	63	50	50	0	tp:A:S	cm:i:14	s1:i:50	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	14	64	-	bcAd1020T	78	13	63	50	50	0	tp:A:S	cm:i:14	s1:i:50	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	14	62	-	bcAd1016T	78	15	63	48	48	0	tp:A:S	cm:i:13	s1:i:48	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	92	140	-	bcAd1017T	78	15	63	48	48	0	tp:A:S	cm:i:13	s1:i:48	dv:f:0	rl:i:0
 

I have tested with -x map-hifi and here is its result for mMelMel3_T4:

m64094_200911_174739/112397133/ccs	16666	16606	16649	+	bcAd1022T	78	1	44	43	43	3	tp:A:P	cm:i:3	s1:i:43	s2:i:0	dv:f:0.0150	rl:i:0
m64094_200911_174739/125897038/ccs	12653	93	154	-	bcAd1022T	78	1	62	61	61	22	tp:A:P	cm:i:4	s1:i:61	s2:i:0	dv:f:0	rl:i:0
m64094_200911_174739/125897038/ccs	12653	15	76	-	bcAd1022T	78	1	62	61	61	22	tp:A:P	cm:i:4	s1:i:61	s2:i:0	dv:f:0	rl:i:0

They're different but the sequences needs to be filtered might be the same in both case (m64094_200911_174739/112397133/ccs and m64094_200911_174739/125897038/ccs)

@reichan1998
Copy link
Contributor Author

Hi @tkchafin, here's the blocklist result by BLASTN after filtering.

m64094_200911_174739/112397133/ccs
m64094_200911_174739/118490605/ccs
m64094_200911_174739/122815136/ccs
m64094_200911_174739/125897038/ccs
m64094_200911_174739/1311016/ccs
m64094_200911_174739/147720881/ccs
m64094_200911_174739/155451773/ccs
m64094_200911_174739/158007639/ccs
m64094_200911_174739/22743255/ccs
m64094_200911_174739/30081129/ccs
m64094_200911_174739/77136980/ccs

@tkchafin
Copy link
Contributor

tkchafin commented Aug 23, 2024

@reichan1998 Thanks for providing those results, it seems like blastn is finding a lot more hits. So I guess the question is (1) if we can alter the settings for minimap2 to better fit the blastn parameters; (2) if replacing with minimap2 is worthwhile at all; and (3) if another blastn alternative might be worth considering (mega blast or some specific optimisation like this one https://www.ncbi.nlm.nih.gov/pmc/articles/PMC4652774/)

@reichan1998
Copy link
Contributor Author

Thanks, Tyler. I’ll start by looking into how we can adjust the settings for minimap2 to better align with the blastn parameters and then evaluate if it’s worth replacing blastn with minimap2 or if another alternative might be more effective. I’ll update you on what I find.

@tkchafin
Copy link
Contributor

Closing as we have determined minimap2 splice to not work as an alternative. Opening new ticket to explore alternatives #114

@tkchafin tkchafin closed this Sep 10, 2024
@reichan1998 reichan1998 deleted the try_minimap2_splice branch September 17, 2024 04:26
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Try minimap2 splice
2 participants