Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Fix: update dropped strains file to list accession instead of strain names #26

Merged
merged 5 commits into from
Feb 13, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Feb 10, 2024

Description of proposed changes

In the phylogenetic workflow, we initially listed strains slated to be dropped from the build (due to excessive divergence or misclassification) in the config/dropped_strain.txt file, identified by their strain names. Subsequently, we transitioned to using "accession" numbers to identify strains after merging the ingest pipeline (and using ncbi-datasets api) 8ab810f. However, updating the corresponding dropped strain list was missed resulting in these strains not being appropriately dropped from the build.

This commit addresses this issue by adding accession numbers to ensure proper dropping of these records.

Steps to find accessions:

  • For strain whose names exactly match those in the fauna metadata.tsv file, the associated accession numbers are used.
  • A visual check is performed for the remaining strains, identifying similarly named strains
    (e.g. DAK_Ar_510 is probably a shorter name for DENV2/COTE_D_IVOIRE/DAKAR510/1980)
  • Strains are cross-referenced against the ingest metadata.tsv file for those collected in the same year and country, considering potential differences in underscores or hyphens separating the strain name (e.g. DENV1/VIETNAM/BIDV992/2006 equivalent to DENV-1/VN/BID-V992/2006).
  • Perform a rough search of dengue academic papers to find a similarly named strain, and reference to an accession number. (e.g. A search for DENV/SPAIN/EEB17/2009 led to https://www.ncbi.nlm.nih.gov/pmc/articles/PMC3149010/ which led to accession number JF260983)

Related issue(s)

Checklist

  • Checks pass

Post Merge rebase and update checklist

@j23414 j23414 requested a review from a team February 10, 2024 00:12
Copy link
Member

@victorlin victorlin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Wow, this looks like some manual work! I haven't changed these files before. Added some non-blocking suggestions from my perspective.

phylogenetic/config/dropped_strains.txt Outdated Show resolved Hide resolved
Rename the file dropped_strains.txt to exclude.txt to better reflect its
purpose since it lists accession numbers instead of strain names.
This file is a list of sequences to exclude from analysis and gets
passed to `augur filter --exclude`.
joverlee521 added a commit to nextstrain/pathogen-repo-guide that referenced this pull request Feb 13, 2024
@j23414 j23414 merged commit 813fd31 into main Feb 13, 2024
8 checks passed
@j23414 j23414 deleted the fix-dropped-strains branch February 13, 2024 21:45
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants