Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

How does a small database affect the taxonomic assignment? #2023

Open
Fmicro23 opened this issue Sep 18, 2024 · 1 comment
Open

How does a small database affect the taxonomic assignment? #2023

Fmicro23 opened this issue Sep 18, 2024 · 1 comment

Comments

@Fmicro23
Copy link

Good morning!
I am working on marine aerobic methanotrophs, a group of microorganisms that oxidize methane in the presence of oxygen, and I used dada2 to analyse pmoA gene (econding methane monooxygenase) sequences that I took from a study. This study used a specific pmoA gene database to assign the taxonomy and I wondered IF and HOW a small database can affect the algorithm behind the "assignTaxonomy" function. If so, can you suggest me how to solve this problem?

Many many thanks in advance!

@benjjneb
Copy link
Owner

You can only assign to what is in your database. So that is one way. The second important way is that without outgroup sequences, you can make spurious assignments to what is in the database even when the query sequence is not very similar to anything in the reference database. This is because the way the naive Bayesian classifier method (implemented by assignTaxonomy) calculates the certainty of an assignment. It subsamples the query sequence and checks how often those subsample get assigned the same taxonomy. Small databases with no outgroup sequences are much more likely to have the subsamples match the same reference sequence as the full query simply because no other remotely similar sequences are in the database.

I don't know about pmoA references, but you could consider a method like IdTaxa in the DECIPHER package as an alternative approach. This method directly considers sequence similarity between query and best reference match when making assignments, and is more robust to the type of error I described above.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants