-
Notifications
You must be signed in to change notification settings - Fork 20
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add Nextclade dataset version tracking file #390
base: master
Are you sure you want to change the base?
Conversation
We currently don't track publicly which Nextclade dataset versions are used to produce the nextclade.tsv This is a feature request for open covSpectrum, see GenSpectrum/cov-spectrum-website#670 (comment) It is also useful to know for ourselves when we last did a full run This PR adds a file that contains a line for each dataset download with timestamp of download and dataset version downloaded Before a full run, this file needs to get wiped, so instructions are added to the README The format of the produced file should be considered unstable for now this is made explicit in the README The implementation isn't perfect, it could be improved by syncing better with rerun touchfile, but it should be good enough given limited resources This PR is unlikely to corrupt data. It could cause a crash, but that can be mitigated by fix or revert so no extensive testing should be necessary before merging Local testing should be sufficient
This seems like a good idea and will be useful for others too. Might be worth asking Tom for ideas if the current proposed system seems unstable, he often has good insights into this kind of stuff :) |
Thanks @emmahodcroft! I have already added @tsibley to the list of reviewers :) I don't think the current implementation is unstable. What I meant was that I don't want us to commit to this version format being considered "stable". We may experiment with it, change it. E.g. make it a tsv, with headers. So people using it should just not expect this to never change. Better implementation is always welcome but given this is not a super crucial feature I'd be hesitant to sink too much time into it. The medium/long-term solution would be to make nextclade output the version into the nextclade.tsv, then we wouldn't have to hack it in. I remember @rneher seemed open to this, so this code hopefully won't have to stay too long in there. Advantage of being "unstable" is that we could simply retire it any time and say: now the version is tracked in nextclade.tsv and metadata.tsv and not have to produce that version file anymore. |
Ah ok, sorry, I misinterpreted - I thought you meant unstable in a more traditional sense. Seems fair to mark this as a little 'beta' and see how it works/get some ideas about other ways to do it and/or whether to put into the |
My usage of |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thinking through the use of the new versions file, I had a couple of questions:
- Do we always do a run re-run of Nextclade when new datasets are released? If not, should we start doing that so that the metadata does not have a mixture of different dataset versions? Then, we only need to record when we do a full re-run in the versions file. With the versions file, we can automate the pipeline to do a full re-run if the dataset versions don't match.
- If we don't ensure all metadata records are generated from the same dataset, then do we need to report dataset versions per row? (Just writing that out made it seem like overkill, we should probably just focus on option [1].)
- How is this file expected to be used? For cov-spectrum, would they just assume the last line of the file contains the version for the latest metadata? Or would we expect them to do some form of timestamp matching?
We don't need to tackle all of these in this PR, but just wanted to get the full picture.
params: | ||
dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt", | ||
src_version_file=config["s3_src"] + "/version_{dataset_name}.txt", | ||
shell: | ||
""" | ||
./bin/download-from-s3 {params.dst_version_file} {output.version} 0 || \ | ||
./bin/download-from-s3 {params.src_version_file} {output.version} 0 || \ | ||
touch {output.version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It would be best to move this to a separate rule so that we can continue to keep the main pipeline untangled from the S3 interactions. I think this file can follow our current pattern of having a create_empty_*
and a download_*_from_s3
rule so that this pipeline can run without S3 interactions.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good point, not something I've ever done so wasn't quite aware this was a guarantee we make.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)
|
||
## Unstable files produced by workflow | ||
|
||
- `version_sars-cov-2.txt` and `version_sars-cov-2-21L.txt`: used to track the version of the nextclade dataset used to generate the `nextclade.tsv` and `nextclade_21L.tsv` files. Format: `timestamp dataset_version` (e.g. `2023-02-06T14:40:23Z 2023-02-01T12:00:00Z`) for each run since and including the last full run. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Am I understanding correctly that the first line of the versions file is the last full Nextclade run?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, that's what it should be in practice!
Thanks for your comments and review @joverlee521!
For the 21L dataset we don't need to rerun, except if the calculation logic of the Bloom scores change. We can rerun but it's a bit of a waste. I first thought about just using the most recent version, then realized that it could be helpful to track all the versions that contribute to the current cache - this could help debug, especially for 21L when something unexpected happens. I agree it would be nice to do an automatic full run if there's a new version. This would be a good further PR - but isn't necessary for this use case (getting at least some readable version information out). The automatic rerun should only apply to the base dataset for now, not 21L. However, given the 21L dataset is of minor importance, we could chuck out the code related to the version there to keep the code simple. I don't think it's that useful after all to have that version. It's just 2 not super important columns in the metadata.tsv
I think we may enable such an option in the nextclade.tsv - we already have many columns, adding a bit redundant version info could be useful depending on use case. We could add it, it doesn't do much harm, it compresses well. It would make code easy/clean.
Good question, since all lines should be the same, they can pick any line. It may not be correct when there's a bug but given that the version will only be used to display metadata on the website it's no crucial that it is 100% correct. If something unexpected happens, we can investigate. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
My overall comments are the same as @joverlee521's. :-) I concur that it'd be best to ensure that the Nextclade outputs are only ever from a single Nextclade dataset version and never a mixture. That doesn't have to be this PR (but if it was me, I'd probably just do it now rather than put it off).
A few more detailed comments below on some minor issues.
for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do | ||
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null | ||
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is syntactically valid bash, but not what you intended. It'd produce four loop iterations where:
file="["
file="nextclade.tsv.zst.renew,"
file="version_sars-cov-2.txt"
file="]"
🙃
You want:
for file in [ "nextclade.tsv.zst.renew", "version_sars-cov-2.txt" ]; do | |
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null | |
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null | |
done | |
for file in nextclade.tsv.zst.renew version_sars-cov-2.txt; do | |
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null | |
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null | |
done |
for file in [ "nextclade_21L.tsv.zst.renew", "version_sars-cov-2-21L.txt" ]; do | ||
aws s3 cp - s3://nextstrain-ncov-private/$file < /dev/null | ||
aws s3 cp - s3://nextstrain-data/files/ncov/open/$file < /dev/null | ||
done |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Ditto.
params: | ||
dst_version_file=config["s3_dst"] + "/version_{dataset_name}.txt", | ||
src_version_file=config["s3_src"] + "/version_{dataset_name}.txt", | ||
shell: | ||
""" | ||
./bin/download-from-s3 {params.dst_version_file} {output.version} 0 || \ | ||
./bin/download-from-s3 {params.src_version_file} {output.version} 0 || \ | ||
touch {output.version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
More of maintaining a separation of concerns than a guarantee, I think. (Although there are a few external users of ncov-ingest who might notice if that separation breaks down.)
./nextclade dataset get --name="{wildcards.dataset_name}" --output-zip={output.dataset} --verbose | ||
printf %s "$(date --utc +%FT%TZ) " >> {output.version} | ||
nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Mixing of ./nextclade
vs. nextclade
. Is that a mismatch?
printf %s "$(date --utc +%FT%TZ) " >> {output.version} | ||
nextclade dataset list --name="{wildcards.dataset_name}" --json | jq -r '.[0].attributes.tag.value' >>{output.version} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There should be only one writer to this file, I think, but it'd still be more robust (and IMO clearer) to append the whole line at once rather than in two separate operations.
We currently don't track publicly which Nextclade dataset versions
are used to produce the nextclade.tsv
This is a feature request for open covSpectrum, see
GenSpectrum/cov-spectrum-website#670 (comment)
It is also useful to know for ourselves when we last did a full run
This PR adds a file that contains a line for each dataset download
with timestamp of download and dataset version downloaded
Before a full run, this file needs to get wiped, so
instructions are added to the README
The format of the produced file should be considered unstable for now
this is made explicit in the README
The implementation isn't perfect, it could be improved
by syncing better with rerun touchfile, but it should be good enough
given limited resources
This PR is unlikely to corrupt data. It could cause a crash, but that
can be mitigated by fix or revert so no extensive
testing should be necessary before merging
Local testing should be sufficient and was successful