-
Notifications
You must be signed in to change notification settings - Fork 32
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
contig_id in any of the files cannot be mapped back to original FASTA #99
Comments
So the reason for the fixing of fasta headers is that spaces and other characters can cause issues with tools like blast but there are a variety of tools that put information in the fasta headers delimited by spaces. Al of the blast calls requires mapping back to the contig identifiers and spaces cause issues. We were originally planning on stripping the headers entirely in favor of a sequential integer id but compromised on this since MOB-suite does use the circularity information and it would be harder to map things back. I can put this in as a feature request to have the original headers maintained in the outputs. |
Sorry for my digging further, but how does it cause issues with BLAST? As far as I'm aware, BLAST outputs contig ID as per FASTA standard in many/most of its output formats. Thanks again, would really appreciate this addition! |
The issue #87 shows the problem with super long headers. I have made a change to the code that all sequences will be renamed in the format {int}_{md5}_circular={status} for all blast and mash searches. Then the program will relabel all of the sequences to their original contig id's. This should solve your issue, as I agree there is a problem with mapping contigs back for integrating results from other tools. version 3.1.0 will implement this |
Thanks @jrober84 for the effort! One suggestion: I'd suggest putting the sequence ID in the
|
Thanks for sharing reflections on |
Thanks @kbessonov1984. While the output may be "correctly" written to TSV by python, it won't be parseable as it will have variable numbers of columns depending on if a FASTA sequence had tabs in its sequence header. Eg. I don't think you'll be able to read the mob-recon output back into pandas or us python's |
Thanks for the great tool and your help both on and offline.
I'm curious about the decision to
fix_fasta_header
before processing the FASTA. This makes it basically impossible (as far as I can tell) to combine MOB-Suite outputs with those of other tools dowstream, as it has irreversibly modified the actual contig IDs and then puts those out as new contig IDs in all of its outputs. Was this a conscious decision? Is it possible to leave the contig IDs as the FASTA standard which is the first word before a whitespace in the FASTA header?It seems like another approach would be to parse the input FASTA file with biopython, maintaining separation between the contig ID and description. As far as I can tell, it seems that this is the only place that actually needs the sequence description:
mob-suite/mob_suite/mob_recon.py
Lines 1138 to 1142 in 1d735b3
I'd be happy to open a PR if that makes sense to you? Thanks again!
The text was updated successfully, but these errors were encountered: