-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add phylogenetic #8
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Reviewed workflow bits only, will leave scientific review to others.
colors = "results/colors_{segment}.tsv" | ||
shell: | ||
""" | ||
python3 scripts/assign-colors.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Another reminder for us that there's an open issue to add this functionality to augur: nextstrain/augur#1185
display_strain_field=config["display_strain_field"], | ||
shell: | ||
""" | ||
python3 scripts/set_final_strain_name.py \ |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
non-blocking
Another reminder for us to get back to nextstrain/auspice#1668.
Is setting the strain name helpful for lassa? It looks like a majority of the strain names are the GenBank accession anyways.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In practice, strain names are commonly used on phylogenetic trees for two main reasons:
- Communication: It's easier to discuss and refer to a named strain rather than a complex identifier.
- Standardization: Using strain names encourages sample submitters to develop clear naming conventions.
For Lassa virus specifically, the strain name serves an additional important function. Since researchers can submit two separate GenBank samples per strain (one for the L segment and one for the S segment), the consistent use of strain names allows looking at tanglegrams of the segments.
Thanks for flagging! In my quick exploration, indeed there was only 49% of the samples getting strain names. After digging in, rescued several more so that percentage should be higher.
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Copy the "copy_example_data" custom rules from the pathogen-repo-guide * https://github.com/nextstrain/pathogen-repo-guide/tree/e3bfb52c8155058a3d48592f4268a7382bf3e12a/phylogenetic/build-configs/ci
Part of work to update this repo to match the pathogen-repo-guide.
Part of work to update this repo to match the pathogen-repo-guide.
Part of work to update this repo to match the pathogen-repo-guide.
Part of work to update this repo to match the pathogen-repo-guide.
Augur align detects the reference strain in the reference file and the curated dataset, and throws a "duplicate strain error" `Duplicate strains of "KM822127" detected` Usually I bypass this using `augur align --remove-reference` but this error is still showing up. Ergo, adding a postfix to the reference IDs to bypass error.
To match the curated sequences, fixup example sequences to ID on accession.
Since there are more countries represented then the original lassa build, autogenerate colors for geolocations. This was copied and modified from the "colors" rule in RSV's workflow * https://github.com/nextstrain/rsv/blob/a1788ce2c9c4375fb5a06d1426c64c45cf90225f/workflow/snakemake_rules/export.smk#L13-L27
* Capitalize L and S to match ingest * Refactor and place intermediate files in segment directories * Match segment capitalization in reference files and example files
phylogenetic/defaults/description.md
Outdated
@@ -1,5 +1,9 @@ | |||
We gratefully acknowledge the authors, originating and submitting laboratories of the genetic sequences and metadata for sharing their work. Please note that although data generators have generously shared data in an open fashion, that does not mean there should be free license to publish on this data. Data generators should be cited where possible and collaborations should be sought in some circumstances. Please try to avoid scooping someone else's work. Reach out if uncertain. | |||
|
|||
This work is made possible by the open sharing of genetic data by research groups, including these groups currently collecting Lassa sequences: [Christian Happi](http://acegid.org/), [Pardis Sabeti](https://www.sabetilab.org/), [Katherine Siddle](https://www.sabetilab.org/katherine-siddle/) and colleagues, whose data was shared via [this virological.org post](http://virological.org/t/new-lassa-virus-genomes-from-nigeria-2015-2016/191). If you intend to use these sequences prior to publication, please contact them directly to coordinate. | |||
|
|||
The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
couple nits:
- fix cap
- remove whitespace
- for the other institutions, parenthetical followups are used for institutional abbreviations, not location -- standardize entry for Harvard/The Broad to match
Many of the institutions have active websites (e.g., ISTH = https://www.isth.org.ng/); consider linking to them.
The Irrua specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID ), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University (Cambridge, MA, USA). For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated. | |
The Irrua Specialist Teaching Hospital (ISTH) and Institute for Lassa Fever Research and Control (ILFRC), Irrua, Edo State, Nigeria; The Bernhard-Nocht Institute for Tropical Medicine (BNITM), Hamburg, Germany; Public Health England (PHE); African Center of Excellence for Genomics of Infectious Disease (ACEGID), Redeemer’s University, Ede, Nigeria; Broad Institute of MIT and Harvard University, Cambridge, MA, USA. For further details, including conditions of reuse, please contact [Ephraim Epogbaini](mailto:[email protected]), [Stephan Günther](http://www.who.int/blueprint/about/stephan-gunther/en/), and [Philippe Lemey](https://rega.kuleuven.be/cev/ecv/lab-members/PhilippeLemey.html). Their data was first shared via [this virological.org post](http://virological.org/t/2018-lasv-sequencing-continued/192/8), which is continually updated. |
Co-authored-by: John SJ Anderson <[email protected]>
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst" | ||
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst" |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
These URLs need to be updated based on the current upload config
"s3://nextstrain-data/files/workflows/lassa/metadata_all.tsv.zst" | |
"s3://nextstrain-data/files/workflows/lassa/sequences_all.fasta.zst" | |
"s3://nextstrain-data/files/workflows/lassa/all/metadata.tsv.zst" | |
"s3://nextstrain-data/files/workflows/lassa/all/sequences.fasta.zst" |
Side question, should these check the L/S files since they are the files used by the phylogenetic workflow?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question! Considering that this same workflow in dengue only checks for the 'all' serotype, I believe this approach should be sufficient? Since the 'all', 'l', and 's' files are updated concurrrently, they should equally trigger the phylogenetic workflow.
However, since there is no such thing as an 'all' tree for lassa (unless we concatenated segments) and if we later decide that the all
dataset is not necessary for debugging, I could see using either 'l' or 's' instead, just in case.
phylogenetic/Snakefile
Outdated
@@ -8,11 +8,14 @@ workdir: workflow.current_basedir | |||
# Use default configuration values. Override with Snakemake's --configfile/--config options. | |||
configfile: "defaults/config.yaml" | |||
|
|||
SEGMENTS = ["l", "s"] | |||
segments = ["L", "S"] |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Flagging that the l
-> L
and s
-> S
change will require updating the nextstrain.org manifest for lassa
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Geh, good to know. I can revert L
-> l
in phylogenetic and back up to ingest as that may be a more consistent-with-history solution.
Co-authored-by: Jover Lee <[email protected]>
…t live Pushing the phylogenetic build to staging instead of production, to allow for time for SME's to review the build before making it live. Make sure to update this to the live url once the build is approved.
…all that meet a min length requirement
This additional commit (7cde259) updates I removed some group-by filtering to include more data while adding a minimum length filter to ensure quality. This change enables a comprehensive view of the dataset. |
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Move lassa CI to the latest pathogen-repo-ci Depends on nextstrain/lassa#8
Description of proposed changes
Add phylogenetic workflow
phylogenetic
directory to match pathogen repo guideThe resulting phylogenetic trees are staged at:
Related issue(s)
Checklist
Post-merge clean up
As mentioned in nextstrain/conda-base#85 (comment), once this is merged, various downstream CIs will need to be updated: