Skip to content

Commit

Permalink
Modernize phylogenetic build [#4]
Browse files Browse the repository at this point in the history
NOTE: As of this commit, the build _runs_, but only for whole genome,
not for the E1 gene-specific build. Additionally, many aspects of the
build are uncorrect, and need to be tuned or revised.
  • Loading branch information
genehack committed Jan 30, 2025
1 parent 4725ec4 commit 6fdbc3d
Show file tree
Hide file tree
Showing 22 changed files with 978 additions and 353 deletions.
45 changes: 22 additions & 23 deletions phylogenetic/README.md
Original file line number Diff line number Diff line change
@@ -1,37 +1,36 @@
# Phylogenetic
# Phylogenetic workflow

This workflow uses metadata and sequences to produce one or multiple [Nextstrain datasets][]
that can be visualized in Auspice.

Resulting tree is available here: https://nextstrain.org/groups/neherlab/staging/nipah

## Background

See e.g. [Whitmer et. al, 2020](https://academic.oup.com/ve/article/7/1/veaa062/5894561)
This workflow uses metadata and sequences to produce one or multiple
[Nextstrain datasets][] that can be visualized in Auspice.

## Data Requirements

The core phylogenetic workflow will use metadata values as-is, so please do any
desired data formatting and curations as part of the [ingest](../ingest/) workflow.
The core phylogenetic workflow will use metadata values as-is, so
please do any desired data formatting and curations as part of the
[ingest][] workflow.

1. The metadata must include an ID column that can be used as as exact match for
the sequence ID present in the FASTA headers.
2. The `date` column in the metadata must be in ISO 8601 date format (i.e. YYYY-MM-DD).
1. The metadata must include an ID column that can be used as as exact
match for the sequence ID present in the FASTA headers.
2. The `date` column in the metadata must be in ISO 8601 date format
(i.e. YYYY-MM-DD).
3. Ambiguous dates should be masked with `XX` (e.g. 2023-01-XX).

## Config

The config directory contains all of the default configurations for the phylogenetic workflow.

[config/defaults.yaml](config/defaults.yaml) contains all of the default configuration parameters
used for the phylogenetic workflow. Use Snakemake's `--configfile`/`--config`
options to override these default values.
[defaults/config.yaml][] contains all of the default configuration
parameters used for the phylogenetic workflow. Use Snakemake's
`--configfile`/`--config` options to override these default values.

## Snakefile and rules

The rules directory contains separate Snakefiles (`*.smk`) as modules of the core phylogenetic workflow.
The modules of the workflow are in separate files to keep the main ingest [Snakefile](Snakefile) succinct and organized.
Modules are all [included](https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes)
in the main Snakefile in the order that they are expected to run.
The rules directory contains separate Snakefiles (`*.smk`) as modules
of the core phylogenetic workflow. The modules of the workflow are in
separate files to keep the main ingest [Snakefile][] succinct and
organized. Modules are all [included][] in the main Snakefile in the
order that they are expected to run.

[defaults/config.yaml]: ./config/defaults.yaml
[included]: https://snakemake.readthedocs.io/en/stable/snakefiles/modularization.html#includes
[ingest]: ../ingest/
[Nextstrain datasets]: https://docs.nextstrain.org/en/latest/reference/glossary.html#term-dataset
[Snakefile]: ./Snakefile
Loading

0 comments on commit 6fdbc3d

Please sign in to comment.