ingest: Add build-configs for CI #56

joverlee521 · 2024-07-16T22:08:51Z

We already have the build-configs for CI in the phylogenetic workflow, so I think reasonable to add build-configs for CI in the ingest workflow. This will make it simpler for internal team to set up the GH Action workflow using pathogen-repo-ci.

Things to consider

Include a bogus CI config param (e.g. zika)
Consider adding a standardized way to "subsample" the ingest data since some ingest workflows can run too long for a responsive CI job.

The text was updated successfully, but these errors were encountered:

joverlee521 · 2024-07-16T23:55:23Z

Thinking out loud on options to "subsample" ingest data.

Instead of fetching from NCBI, start with an example NCBI Dataset ZIP that is a small subset of the data. This is likely not sustainable as we would have to update the example data every time NCBI Datasets updates their schema.
Filter the NCBI Datasets data during fetch using available CLI options, e.g. --geo-location or --released-after. This doesn't guarantee that we will fetch a "small" subset of data, but at least it's not all data.
Fetch the full NCBI Dataset then filter the outputs locally. It's not clear to me that this would reduce run time as the filtering step might take a while?

victorlin · 2024-07-17T16:51:37Z

I'm assuming there are two goals of ingest CI:

a. Ensure an update to the ingest workflow works with existing NCBI data
b. Ensure new data from NCBI works with the existing ingest workflow

Re: the 3 options above

I would consider this if (b) is not a high priority. Or if NCBI Datasets only updates there schema occasionally, it might be fine as long as there are other scheduled runs on the full data that would clearly surface the need to update.
The failure modes of (a) should be considered for any filtering. Would it be weird outliers from unknown/new locations? If yes, then --geo-location wouldn't be a good filter.
Is the goal here to apply downsampling rules that aren't available via NCBI Datasets CLI (e.g. random sampling)? It would be good to see where the time bottleneck is. If data download is slow, then this option wouldn't make a difference whereas the CLI options should (?) filter things server-side which would be faster. Also, what tool would be used to do the filtering? I imagine that once all the data is on disk, downsampling should be fast with the right tool.

joverlee521 · 2024-07-17T17:30:13Z

Thanks for the thoughts here @victorlin! I agree we care about (a) more to ensure the ingest workflow runs as expected.

@corneliusroemer has raised nextstrain/measles#46 for not relying on external services in CI, which aligns with option [1].

I'll probably port whatever we implement in measles into this repo.

corneliusroemer · 2024-07-17T17:51:38Z

Good discussion! I hadn't seen this here before the measles PR.

I'm not sure how often datasets-cli changes their schema. I don't expect this to happen very often, but I might be wrong.

If the zip package/schema changes, we could just start with test files after all ncbi datasets commands to not rely on stability of NCBI somewhat internal API (not sure how internal the downloaded package is, we'll find out).

joverlee521 · 2024-07-17T19:17:10Z

Copying @tsibley 's relevant comments from the measles issue:

I think we should not use ncbi datasets in CI, and instead use an archived zip file that is identical to what datasets would produce.

The tradeoff is that we'll have to keep the example NCBI dataset file up-to-date with upstream changes, else we won't be testing what NCBI's actually providing and will drift over time.

joverlee521 · 2024-07-17T20:16:02Z

Stepping back to consider the goals of ingest CI that @victorlin astutely pointed out

a. Ensure an update to the ingest workflow works with existing NCBI data
b. Ensure new data from NCBI works with the existing ingest workflow

I think (a) is the more frequent check within a pathogen repo, while (b) is important to test when we update the NCBI datasets version in docker-base/conda-base. I don't thinks there's a simple way to separate those two concerns with the current pathogen-repo-ci workflow.

joverlee521 · 2024-07-19T20:32:24Z

From yesterday's dev chat:

Consensus: we are good with caching ingest downloads for CI purposes, particularly when a repo is also automatically trying to rebuild daily and running “live” ingest during that process

joverlee521 added the enhancement New feature or request label Jul 16, 2024

joverlee521 mentioned this issue Jul 16, 2024

Switch CI to pathogen-repo-ci workflow nextstrain/dengue#80

Merged

1 task

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ingest: Add build-configs for CI #56

ingest: Add build-configs for CI #56

joverlee521 commented Jul 16, 2024

joverlee521 commented Jul 16, 2024

victorlin commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

corneliusroemer commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

joverlee521 commented Jul 19, 2024 •

edited

Loading

ingest: Add build-configs for CI #56

ingest: Add build-configs for CI #56

Comments

joverlee521 commented Jul 16, 2024

joverlee521 commented Jul 16, 2024

victorlin commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

corneliusroemer commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

joverlee521 commented Jul 17, 2024

joverlee521 commented Jul 19, 2024 • edited Loading

joverlee521 commented Jul 19, 2024 •

edited

Loading