Improvements of the genetics etl for platform integration #57

project-defiant · 2024-10-23T07:05:59Z

Context

Genetics etl dag described by the image below

should be possible to execute in two modes:

run with the unified pipeline
run in standalone mode

Here is the list of possible improvements that I can see, can fit the genetics_etl dag to fit the above conditions:

Extract variant_to_vcf and list_nonannotated_variants as a single dataproc step

Currently varaiant_to_vcf step uses sources from the etl, namely:

gs://open-targets-pre-data-releases/24.09/input/evidence-files/uniprot.json.gz
gs://open-targets-pre-data-releases/24.09/input/evidence-files/eva.json.gz
gs://open-targets-pre-data-releases/24.09/input/pharmacogenomics-inputs/pharmacogenomics.json.gz

evidence-files comes directly from the PIS (no other dependencies)
pharmacogenomics-inputs are generated by platform ETL

As first step variant_to_vcf is run as a google batch job and second step list_nonannotated_variants is run as a standalone task (pythonOperator) can be submerged and run as a single dataproc step, we could decrese the complexity of the pipeline by removing the batch job and it's configuration, so it's more generic and easier to transfer to unified pipeline.

Extract the configuration of the steps and submerge as a unified pipeline config empowered by hydra. This will roll back the steps to the way how they were handled before in gentropy - see https://github.com/opentargets/gentropy/tree/v1.7.0/config, but decomplexified by only hosting the config for the genetics_etl steps.

This would mean that we could store one config per way how we run the pipeline:

default config for standalone execution mode
one config for unified pipeline that overrides the default config paths.

The text was updated successfully, but these errors were encountered:

project-defiant · 2024-10-23T10:34:34Z

@javfg These are the things we discussed summarized

project-defiant mentioned this issue Oct 25, 2024

Pipeline unification opentargets/issues#3394

Open

27 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improvements of the genetics etl for platform integration #57

Improvements of the genetics etl for platform integration #57

project-defiant commented Oct 23, 2024 •

edited

Loading

project-defiant commented Oct 23, 2024

Improvements of the genetics etl for platform integration #57

Improvements of the genetics etl for platform integration #57

Comments

project-defiant commented Oct 23, 2024 • edited Loading

Context

project-defiant commented Oct 23, 2024

project-defiant commented Oct 23, 2024 •

edited

Loading