Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Snakemake improvements #70

Open
jameshadfield opened this issue Jul 9, 2024 · 2 comments
Open

Snakemake improvements #70

jameshadfield opened this issue Jul 9, 2024 · 2 comments

Comments

@jameshadfield
Copy link
Member

This is a series of observations about the current situation of the workflow and the complexities we've recently introduced to it as we've adapted it to handle the ongoing cattle-flu outbreak. This includes complexities added in (currently unmerged) Segment builds for H5N1 cattle flu outbreak. The suggestions here are roughly ordered as each builds on previous ones.

Config files

We've probably reached the point where config files would be clarifying rather than confusing. As @joverlee521 notes:

the main Snakefile would be simplified a lot if the param functions were pulled out into config files. If the params were defined in config files, we would be able to remove a lot of the if w.subtype=='h5n1-cattle-outbreak' checks.

I'm imagining here we would have config files for:

  • One each for H5N1, H5Nx, H7N9, H9N2
  • h5n1-cattle-outbreak genome build
  • h5n1-cattle-outbreak segments built (maybe this could be combined with the genome config?)

Optional time wildcard.

The per-segment cattle-flu builds don't have this wildcard, which is why the filename produced by the pipeline doesn't match the desired URL. We can maintain the current ruse of using time=all-time and add a final rule to rename the files or somesuch, but the cleaner way would be to make the wildcard optional. In practice (I think, haven't prototyped) this would mean that the three values of time would be "", "_all-time", "_2y".

Bring genome build into canonical snakemake pipeline

With the above changes, merging these pipelines would be a nice simplification. While they use different parameters, all the rules from tree onwards are the same between the pipelines. I think this could be nicely organised if the input to align becomes a function and when w.subtype=h5n1-cattle-outbreak that function refers to outputs of a rules/h5n1-cattle-outbreak-genome.smk ruleset.

Directory structure within results & config

This is rather minor, but the files counts are starting to become hard to manage! I'd think splitting config/ into one directory per config file (as described above) would work nicely. Similarly for results/.

Data files include source in filename

Currently no matter the upstream source (local ingest, GISIAD data, NCBI data) we get the following data files:

  1. data/metadata.tsv
  2. data/{segment}/sequences.fasta
  3. data/metadata_{subtype}.tsv
  4. data/sequences_{subtype}_{segment}fasta

This means when switching between a GISAID build (H5Nx, H5N1, H7N9, H9N2) to a NCBI build (h5n1-cattle-outbreak) these files will be overwritten¹, and then when you switch back to a GISAID build the entire pipeline needs to rerun.

I think it'd be clarifying to have the per-subtype files (3,4) in results/{subtype}/{metadata.tsv,sequences.fasta}. The rule that produces these would itself have inputs parameterised by the upstream source. Some pseudocode to explain what I'm thinking:

config['s3_upstream'] = {
    'name': 'ncbi-h5n1',
    'sequences': 's3://nextstrain-data/files/workflows/avian-flu/h5n1/sequences.fasta.zst',
    'metadata': 's3://nextstrain-data/files/workflows/avian-flu/h5n1/metadata.tsv.zst',
}

rule fetch_from_s3:
    output:
        f"data/{config['s3_upstream']['name']}/metadata.tsv",
        f"data/{config['s3_upstream']['name']}/sequences.fasta"
    shell:
        ...

def upstream_metadata():
    if LOCAL_INGEST:
        return f"ingest/{INGEST_SOURCE}/results/metadata.tsv",
    return f"data/{config['s3_upstream']['name']}/metadata.tsv",

rule filter_metadata_by_subtype:
    input:
        metadata = upstream_metadata,
    output:
        metadata = "results/{subtype}/metadata.tsv",
    ...

¹ It looks like Snakemake is smart enough to detect when a invocation uses different config params (s3_src etc) than those which produced the existing files in data and regenerates them accordingly. This was a nice surprise to me!

@joverlee521
Copy link
Contributor

One each for H5N1, H5Nx, H7N9, H9N2

In principle I agree with this, but want to be explicit that is will be a significant change in how builds are run. Instead of the current single invocation of nexstrain build, this will require 4 separate ones. This isn't an issue if the builds were automated, but seems like they are still being manually run/maintained so keeping track of 4 separate jobs might be too onerous.

As an alternative, we could potentially follow seasonal-flu's example of one config for all 4 subtypes, but that makes the config file more complex...

the cleaner way would be to make the wildcard optional.

Unfortunately, it's not possible to have an empty wildcard in Snakemake
Testing with minimal Snakefile:

rule all:
    input: expand("output{test_wildcard}.txt", test_wildcard=["_a", "_b", ""])

rule a:
    output: touch("output{test_wildcard}.txt")
$ nextstrain build snakemake/empty-wildcard/ --forceall
Building DAG of jobs...
MissingInputException in rule all in file /nextstrain/build/Snakefile, line 1:
Missing input files for rule all:
    affected files:
        output.txt

Since we are adding segment builds for the cattle-outbreak, I wonder if the cattle-outbreak builds should be nested under the canonical h5n1 builds. Then cattle-outbreak would be considered the "time".

@jameshadfield
Copy link
Member Author

Unfortunately, it's not possible to have an empty wildcard in Snakemake

We can do it, but I haven't thought through all the ramifications. Add the following to your minimal Snakefile:

wildcard_constraints:
    test_wildcard=".*"

Instead of the current single invocation of nexstrain build, this will require 4 separate ones.

Oh! Yeah, that may be a dealbreaker

jameshadfield added a commit that referenced this issue Jul 11, 2024
Part of simplifications described in <#70>

This approach avoids using an empty wildcard for time but introduces the need
to have a separate rule to map auspice datasets (which won't have 'time' for h5n1-cattle-flu)
to a filename which does include a time wildcard.
jameshadfield added a commit that referenced this issue Jul 15, 2024
Part of simplifications described in
<#70>

This approach avoids using an empty wildcard for time but introduces the
need to have a separate rule to map the target Auspice dataset JSON
filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a
results file with a time wildcard (e.g.
`results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of
the pipeline is unchanged.
jameshadfield added a commit that referenced this issue Jul 25, 2024
Part of simplifications described in
<#70>

This approach avoids using an empty wildcard for time but introduces the
need to have a separate rule to map the target Auspice dataset JSON
filename (e.g. `auspice/avian-flu_h5n1-cattle-outbreak_pb2.json`) to a
results file with a time wildcard (e.g.
`results/avian-flu_h5n1-cattle-outbreak_pb2_default.json`). The rest of
the pipeline is unchanged.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants