Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add phylogenetic #8

Merged
merged 31 commits into from
Aug 2, 2024
Merged

Add phylogenetic #8

merged 31 commits into from
Aug 2, 2024

Conversation

j23414
Copy link
Contributor

@j23414 j23414 commented Jul 9, 2024

Description of proposed changes

Add phylogenetic workflow

  • Refactor workflow into a phylogenetic directory to match pathogen repo guide
  • Pull curated NCBI genbank data from s3 instead of Fauna
  • Since original trees were built from S and L segments, add rules and configs to pull out segments from NCBI GenBank sequences using Nextclade
  • Update phylogenetic automation rules

The resulting phylogenetic trees are staged at:

Related issue(s)

Checklist

  • Checks pass

Post-merge clean up

As mentioned in nextstrain/conda-base#85 (comment), once this is merged, various downstream CIs will need to be updated:

@j23414 j23414 changed the title [DO NOT MERGE] Add phylogenetic Add phylogenetic Jul 10, 2024
@j23414 j23414 marked this pull request as ready for review July 10, 2024 18:48
Base automatically changed from add-ingest to master July 15, 2024 13:45
Copy link
Contributor

@joverlee521 joverlee521 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Reviewed workflow bits only, will leave scientific review to others.

phylogenetic/rules/prepare_sequences_segments.smk Outdated Show resolved Hide resolved
phylogenetic/defaults/auspice_config.json Outdated Show resolved Hide resolved
phylogenetic/build-configs/ci/copy_example_data.smk Outdated Show resolved Hide resolved
phylogenetic/rules/prepare_sequences_segments.smk Outdated Show resolved Hide resolved
colors = "results/colors_{segment}.tsv"
shell:
"""
python3 scripts/assign-colors.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking
Another reminder for us that there's an open issue to add this functionality to augur: nextstrain/augur#1185

phylogenetic/config/auspice_config.json Outdated Show resolved Hide resolved
display_strain_field=config["display_strain_field"],
shell:
"""
python3 scripts/set_final_strain_name.py \
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

non-blocking
Another reminder for us to get back to nextstrain/auspice#1668.

Is setting the strain name helpful for lassa? It looks like a majority of the strain names are the GenBank accession anyways.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In practice, strain names are commonly used on phylogenetic trees for two main reasons:

  • Communication: It's easier to discuss and refer to a named strain rather than a complex identifier.
  • Standardization: Using strain names encourages sample submitters to develop clear naming conventions.

For Lassa virus specifically, the strain name serves an additional important function. Since researchers can submit two separate GenBank samples per strain (one for the L segment and one for the S segment), the consistent use of strain names allows looking at tanglegrams of the segments.

Thanks for flagging! In my quick exploration, indeed there was only 49% of the samples getting strain names. After digging in, rescued several more so that percentage should be higher.

joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Jul 19, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Jul 19, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 19, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
@joverlee521 joverlee521 mentioned this pull request Jul 19, 2024
4 tasks
joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Jul 22, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
joverlee521 added a commit to nextstrain/docker-base that referenced this pull request Jul 22, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
joverlee521 added a commit to nextstrain/augur that referenced this pull request Jul 22, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
@j23414 j23414 linked an issue Jul 24, 2024 that may be closed by this pull request
j23414 added 11 commits July 30, 2024 06:44
Augur align detects the reference strain in the reference file and the curated dataset, and throws a "duplicate strain error"

`Duplicate strains of "KM822127" detected`

Usually I bypass this using `augur align --remove-reference` but this error is still showing up. Ergo, adding a postfix to the reference IDs to bypass error.
To match the curated sequences, fixup example sequences to ID on accession.
Since there are more countries represented then the original lassa build, autogenerate colors for geolocations.

This was copied and modified from the "colors" rule in RSV's workflow

* https://github.com/nextstrain/rsv/blob/a1788ce2c9c4375fb5a06d1426c64c45cf90225f/workflow/snakemake_rules/export.smk#L13-L27
* Capitalize L and S to match ingest
* Refactor and place intermediate files in segment directories
* Match segment capitalization in reference files and example files
@@ -1,5 +1,9 @@
We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain.

This work is made possible by the open sharing of genetic data by research groups, including these groups currently collecting Lassa sequences: [Christian Happi](http://acegid.org/), [Pardis Sabeti](https://www.sabetilab.org/), [Katherine Siddle](https://www.sabetilab.org/katherine-siddle/) and colleagues, whose data was shared via [this virological.org post](http://virological.org/t/new-lassa-virus-genomes-from-nigeria-2015-2016/191). If you intend to use these sequences prior to publication, please contact them directly to coordinate.

The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

couple nits:

  • fix cap
  • remove whitespace
  • for the other institutions, parenthetical followups are used for institutional abbreviations, not location -- standardize entry for Harvard/The Broad to match

Many of the institutions have active websites (e.g., ISTH = https://www.isth.org.ng/); consider linking to them.

Suggested change
The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.
The Irrua Specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University, Cambridge, MA, USA. For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated.

Comment on lines 68 to 69
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These URLs need to be updated based on the current upload config

Suggested change
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst"
"s3://nextstrain-data/files/workflows/lassa/all/metadata.tsv.zst"
"s3://nextstrain-data/files/workflows/lassa/all/sequences.fasta.zst"

Side question, should these check the L/S files since they are the files used by the phylogenetic workflow?

Copy link
Contributor Author

@j23414 j23414 Jul 31, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question!  Considering that this same workflow in dengue only checks for the 'all' serotype, I believe this approach should be sufficient? Since the 'all', 'l', and 's' files are updated concurrrently, they should equally trigger the phylogenetic workflow.

However, since there is no such thing as an 'all' tree for lassa (unless we concatenated segments) and if we later decide that the all dataset is not necessary for debugging, I could see using either 'l' or 's' instead, just in case.

@@ -8,11 +8,14 @@ workdir: workflow.current_basedir
# Use default configuration values. Override with Snakemake's --configfile/--config options.
configfile: "defaults/config.yaml"

SEGMENTS = ["l", "s"]
segments = ["L", "S"]
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Flagging that the l -> L and s -> S change will require updating the nextstrain.org manifest for lassa

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Geh, good to know. I can revert L -> l in phylogenetic and back up to ingest as that may be a more consistent-with-history solution.

j23414 and others added 5 commits July 31, 2024 16:06
…t live

Pushing the phylogenetic build to staging instead of production, to allow
for time for SME's to review the build before making it live.

Make sure to update this to the live url once the build is approved.
@j23414
Copy link
Contributor Author

j23414 commented Aug 2, 2024

This additional commit (7cde259) updates augur filter to allow visualization of all S (694) and L (1272) segments in Auspice, following @jameshadfield's suggestion during a call.

I removed some group-by filtering to include more data while adding a minimum length filter to ensure quality. This change enables a comprehensive view of the dataset.

@j23414 j23414 merged commit 86cf85b into master Aug 2, 2024
4 checks passed
@j23414 j23414 deleted the add-phylogenetic branch August 2, 2024 14:24
joverlee521 added a commit to nextstrain/conda-base that referenced this pull request Aug 2, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
joverlee521 added a commit to nextstrain/augur that referenced this pull request Aug 2, 2024
Move lassa CI to the latest pathogen-repo-ci
Depends on nextstrain/lassa#8
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

add phylogenetic workflow
3 participants