Modernize phylogenetic build [#4]

NOTE: As of this commit, the build _runs_, but only for whole genome, not for the E1 gene-specific build. Additionally, many aspects of the build are uncorrect, and need to be tuned or revised.
nextstrain · Jan 30, 2025 · 6fdbc3d · 6fdbc3d
1 parent 4725ec4
commit 6fdbc3d
Show file tree

Hide file tree

Showing 22 changed files with 978 additions and 353 deletions.
diff --git a/phylogenetic/README.md b/phylogenetic/README.md
@@ -1,37 +1,36 @@
-# Phylogenetic
+# Phylogenetic workflow
 
-This workflow uses metadata and sequences to produce one or multiple [Nextstrain datasets][]
-that can be visualized in Auspice.
-
-Resulting tree is available here: https://nextstrain.org/groups/neherlab/staging/nipah
-
-## Background
-
-See e.g. [Whitmer et. al, 2020](https://academic.oup.com/ve/article/7/1/veaa062/5894561)
+This workflow uses metadata and sequences to produce one or multiple
+[Nextstrain datasets][] that can be visualized in Auspice.
 
 ## Data Requirements
 
-The core phylogenetic workflow will use metadata values as-is, so please do any
-desired data formatting and curations as part of the [ingest](../ingest/) workflow.
+The core phylogenetic workflow will use metadata values as-is, so
+please do any desired data formatting and curations as part of the
+[ingest][] workflow.
 
-1. The metadata must include an ID column that can be used as as exact match for
-   the sequence ID present in the FASTA headers.
-2. The `date` column in the metadata must be in ISO 8601 date format (i.e. YYYY-MM-DD).
+1. The metadata must include an ID column that can be used as as exact
+   match for the sequence ID present in the FASTA headers.
+2. The `date` column in the metadata must be in ISO 8601 date format
+   (i.e. YYYY-MM-DD).
 3. Ambiguous dates should be masked with `XX` (e.g. 2023-01-XX).
 
 ## Config
 
-The config directory contains all of the default configurations for the phylogenetic workflow.
-
-[config/defaults.yaml](config/defaults.yaml) contains all of the default configuration parameters
-used for the phylogenetic workflow. Use Snakemake's `--configfile`/`--config`
-options to override these default values.
+[defaults/config.yaml][] contains all of the default configuration
+parameters used for the phylogenetic workflow. Use Snakemake's
+`--configfile`/`--config` options to override these default values.
 
 ## Snakefile and rules
 
-The rules directory contains separate Snakefiles (`*.smk`) as modules of the core phylogenetic workflow.
-The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized.
-Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
-in the main Snakefile in the order that they are expected to run.
+The rules directory contains separate Snakefiles (`*.smk`) as modules
+of the core phylogenetic workflow. The modules of the workflow are in
+separate files to keep the main ingest [Snakefile][] succinct and
+organized. Modules are all [included][] in the main Snakefile in the
+order that they are expected to run.
 
+[defaults/config.yaml]: ./config/defaults.yaml
+[included]: https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes
+[ingest]: ../ingest/
 [Nextstrain datasets]: https://docs.nextstrain.org/en/latest/reference/glossary.html#term-dataset
+[Snakefile]: ./Snakefile