feat(gwas_catalog) sumstat harmonisation dag #55

project-defiant · 2024-10-22T15:28:17Z

Context

The aim of this PR is to add the dag that can perform the harmonisation with capacity of ~70k GWAS Catalog summary statistics.

We want to be able to

Harmonise summary statistics in big batches (~10k) sumstats each.
Run harmonisation and summary statistics qc
Collect metrics and logs from each run

Implementation details

To achieve this we decided to set up the google batch job(s) for each of the 10k sumstats

Each batch task will be an instance of the script that performs a small pipeline as describe below:

download sumstat file
unzip of the sumstat file
harmonise sumstat file
dump harmonisation result
save the outcome of the harmonisation
run sumstat qc
dump qc result
saves the outcome of the qc
dump the outcome as a metadata table
dump logs from the script execution

Eventually this requires

configuration of the harmonisation step
- raw sumstat input file
- harmonisation output file
configuration of the qc step
- harmonisation input file (should be passed from the harmonisation step)
- qc output file
- qc metrics
  and additionally
path to store the logs
path to store the metrics from the run (exit codes from harmonisation and qc, size of the file after unpacking)

Note

Above requirements force to create an additional layer of complexity (between the orchestration and gentropy) that will invoke gentropy steps in a pipeline and collect the metrics and logs, so we can decide in the orchestrator how to act.

To do that I have created an image that is based on original gentropy image - europe-west1-docker.pkg.dev/open-targets-genetics-dev/gentropy-app/gentropy:dev.

New image ammends the script for harmonisation and tags it with europe-west1-docker.pkg.dev/open-targets-genetics-dev/gentropy-app/ot_gentropy:${ref}.

Batch job

Although with benefits, the google batch job is not ideal for a number of reasons:

Default operator for the google batch job requires specifying the job definition before it is called. This means that all of the job components needs to be derived from the DAG, rather from the config.

The solution to this issue it to inherit from the default operator to and add a method that parses the job definition from configuration.

To ensure the number of tasks inside the google batch job definition we need to define the task environments, that hold the business components specific to each task (for harmonisation this are the paths to the input and output files). The size of the task environments list affects how many tasks can be there.

As we do not control the names of the inputs and outputs from static config, we need to define the environments dynamicaly by listing the content of the input directory and output directory and generate a todo manifest that will set up the input and output paths. Each row of the manifest will correspond to a single environment.

The size of the TaskEvironments payload send to batch can not exceed the API limit, otherwise you see the warning

400 Request payload size exceeds the limit: 10485760 bytes

This means that we need to limit the number of tasks that can be scheduled in a single batch job to ~10-40K.

To resolve this issue we need to partition the input dataset (no mater the source) into the n batches from which each will correspond to a single google batch job.

The result from a single batch task that runs the harmonisation job are that it can produce the logs and metrics of the job execution in an environment specific to the google batch without touching the business logic, so gentropy can stay platform agnostic.

BatchIndex

I have introduced the concepts for BatchIndexOperator and GeneticsBatchJobOperator. These two operators are required to run in consequence BatchIndexOperator -> GeneticsBatchJobOperator. The first operator is responsible for defining the BatchIndexRow that is a container object storing the cli commands and google.cloud.batch_v1.environments objects. By defining the environments this operator defines the number of corresponding google batch tasks that will be executed within the batch runs.

The BatchIndexOperator specifies the interface that needs to be satisfied by the Implementator function that should consume input parameters comming from the task parameters and output the BatchIndex that implements the task environments list.

The output should be consumed by the GenericBatchJobOperator

Szymon Szyszkowski added 2 commits October 22, 2024 16:02

feat(gentropy_image): gentropy image overloaded by orchestration config

99518f0

feat(gwas_catalog): harmonisation dag

aeab7fa

project-defiant force-pushed the szsz-gwas-catalog-harmonisation-dag branch from 6750ea0 to aeab7fa Compare October 22, 2024 15:38

project-defiant changed the title ~~Szsz gwas catalog harmonisation dag~~ feat(gwas_catalog) sumstat harmonisation dag Oct 23, 2024

chore: cleanup

7c7e595

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(gwas_catalog) sumstat harmonisation dag #55

feat(gwas_catalog) sumstat harmonisation dag #55

project-defiant commented Oct 22, 2024 •

edited

Loading

feat(gwas_catalog) sumstat harmonisation dag #55

Are you sure you want to change the base?

feat(gwas_catalog) sumstat harmonisation dag #55

Conversation

project-defiant commented Oct 22, 2024 • edited Loading

Context

Implementation details

Batch job

BatchIndex

project-defiant commented Oct 22, 2024 •

edited

Loading