Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Improvements of the genetics etl for platform integration #57

Open
project-defiant opened this issue Oct 23, 2024 · 1 comment
Open

Comments

@project-defiant
Copy link
Collaborator

project-defiant commented Oct 23, 2024

Context

Genetics etl dag described by the image below
Image

should be possible to execute in two modes:

  • run with the unified pipeline
  • run in standalone mode

Here is the list of possible improvements that I can see, can fit the genetics_etl dag to fit the above conditions:

  1. Extract variant_to_vcf and list_nonannotated_variants as a single dataproc step

Currently varaiant_to_vcf step uses sources from the etl, namely:

  • gs://open-targets-pre-data-releases/24.09/input/evidence-files/uniprot.json.gz
  • gs://open-targets-pre-data-releases/24.09/input/evidence-files/eva.json.gz
  • gs://open-targets-pre-data-releases/24.09/input/pharmacogenomics-inputs/pharmacogenomics.json.gz
  • evidence-files comes directly from the PIS (no other dependencies)
  • pharmacogenomics-inputs are generated by platform ETL

As first step variant_to_vcf is run as a google batch job and second step list_nonannotated_variants is run as a standalone task (pythonOperator) can be submerged and run as a single dataproc step, we could decrese the complexity of the pipeline by removing the batch job and it's configuration, so it's more generic and easier to transfer to unified pipeline.

  1. Extract the configuration of the steps and submerge as a unified pipeline config empowered by hydra. This will roll back the steps to the way how they were handled before in gentropy - see https://github.com/opentargets/gentropy/tree/v1.7.0/config, but decomplexified by only hosting the config for the genetics_etl steps.

This would mean that we could store one config per way how we run the pipeline:

  • default config for standalone execution mode
  • one config for unified pipeline that overrides the default config paths.
@project-defiant
Copy link
Collaborator Author

@javfg These are the things we discussed summarized

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant