Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Bug: Update dropped strains file to list accession instead of strain #24

Closed
j23414 opened this issue Feb 7, 2024 · 5 comments
Closed
Labels
bug Something isn't working

Comments

@j23414
Copy link
Contributor

j23414 commented Feb 7, 2024

Current Behavior

Currently, strains listed in phylogenetic/config/dropped_strains.txt are not being dropped since 8ab810f

Expected behavior

Strains listed in dropped_strains.txt are not in the final phylogenetic tree.

How to reproduce

Possible solution

Perhaps cherry pick a commit like:

Your environment: if browsing Nextstrain online

  • Operating system:
  • Browser:

Your environment: if running Nextstrain locally

  • Operating system:
  • Browser:
  • Version (e.g. auspice 2.7.0):

Additional context

Add any other context about the problem here.

@j23414 j23414 added the bug Something isn't working label Feb 7, 2024
@victorlin
Copy link
Member

Good catch! If I understand correctly, phylogenetic/config/dropped_strains.txt is used for augur filter --exclude so it should be updated alongside 8ab810f. Looking more carefully at #12, the --sequences input should also be updated to match the new ID values but I don't see that it was changed. Does it still work?

@j23414
Copy link
Contributor Author

j23414 commented Feb 7, 2024

Correct, and yes the --sequences input still works :D

git clone https://github.com/nextstrain/dengue.git
cd dengue/phylogenetic
nextstrain build . data/sequences_all.fasta
grep  ">" data/sequences_all.fasta | head -n5

Which shows the sequences are ID'd by accession, not strain name:

>NC_075403
>NC_075435
>OQ919688
>ON123563
>ON123564

@victorlin
Copy link
Member

victorlin commented Feb 7, 2024

Good to know! But how did it work before #12 if the sequences were ID'd by accession and augur filter was using strain as the ID column?

@j23414
Copy link
Contributor Author

j23414 commented Feb 7, 2024

augur filter was using strain as the ID column?

This worked when we were still pulling strains from fauna (which also took advantage of the deduplication of strain names of fauna).

However, we shifted to using the ingest folder and pulling data using ncbi datasets. After ingest was merged, the data.nextstrain.org/files were updated so we could have a smooth transition.

@victorlin
Copy link
Member

Closed by #26

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants