Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(gwas_catalog) sumstat harmonisation dag #55

Draft
wants to merge 3 commits into
base: dev
Choose a base branch
from

Conversation

project-defiant
Copy link
Collaborator

@project-defiant project-defiant commented Oct 22, 2024

Context

The aim of this PR is to add the dag that can perform the harmonisation with capacity of ~70k GWAS Catalog summary statistics.

We want to be able to

  • Harmonise summary statistics in big batches (~10k) sumstats each.
  • Run harmonisation and summary statistics qc
  • Collect metrics and logs from each run

Implementation details

To achieve this we decided to set up the google batch job(s) for each of the 10k sumstats

Each batch task will be an instance of the script that performs a small pipeline as describe below:

  • download sumstat file
  • unzip of the sumstat file
  • harmonise sumstat file
  • dump harmonisation result
  • save the outcome of the harmonisation
  • run sumstat qc
  • dump qc result
  • saves the outcome of the qc
  • dump the outcome as a metadata table
  • dump logs from the script execution

Eventually this requires

  • configuration of the harmonisation step
    • raw sumstat input file
    • harmonisation output file
  • configuration of the qc step
    • harmonisation input file (should be passed from the harmonisation step)
    • qc output file
    • qc metrics
      and additionally
  • path to store the logs
  • path to store the metrics from the run (exit codes from harmonisation and qc, size of the file after unpacking)

Note

Above requirements force to create an additional layer of complexity (between the orchestration and gentropy) that will invoke gentropy steps in a pipeline and collect the metrics and logs, so we can decide in the orchestrator how to act.

To do that I have created an image that is based on original gentropy image - europe-west1-docker.pkg.dev/open-targets-genetics-dev/gentropy-app/gentropy:dev.

New image ammends the script for harmonisation and tags it with europe-west1-docker.pkg.dev/open-targets-genetics-dev/gentropy-app/ot_gentropy:${ref}.

Batch job

Although with benefits, the google batch job is not ideal for a number of reasons:

  1. Default operator for the google batch job requires specifying the job definition before it is called. This means that all of the job components needs to be derived from the DAG, rather from the config.

The solution to this issue it to inherit from the default operator to and add a method that parses the job definition from configuration.

  1. To ensure the number of tasks inside the google batch job definition we need to define the task environments, that hold the business components specific to each task (for harmonisation this are the paths to the input and output files). The size of the task environments list affects how many tasks can be there.

As we do not control the names of the inputs and outputs from static config, we need to define the environments dynamicaly by listing the content of the input directory and output directory and generate a todo manifest that will set up the input and output paths. Each row of the manifest will correspond to a single environment.

  1. The size of the TaskEvironments payload send to batch can not exceed the API limit, otherwise you see the warning
400 Request payload size exceeds the limit: 10485760 bytes

This means that we need to limit the number of tasks that can be scheduled in a single batch job to ~10-40K.

To resolve this issue we need to partition the input dataset (no mater the source) into the n batches from which each will correspond to a single google batch job.

The result from a single batch task that runs the harmonisation job are that it can produce the logs and metrics of the job execution in an environment specific to the google batch without touching the business logic, so gentropy can stay platform agnostic.

BatchIndex

I have introduced the concepts for BatchIndexOperator and GeneticsBatchJobOperator. These two operators are required to run in consequence BatchIndexOperator -> GeneticsBatchJobOperator. The first operator is responsible for defining the BatchIndexRow that is a container object storing the cli commands and google.cloud.batch_v1.environments objects. By defining the environments this operator defines the number of corresponding google batch tasks that will be executed within the batch runs.

The BatchIndexOperator specifies the interface that needs to be satisfied by the Implementator function that should consume input parameters comming from the task parameters and output the BatchIndex that implements the task environments list.

The output should be consumed by the GenericBatchJobOperator

@project-defiant project-defiant force-pushed the szsz-gwas-catalog-harmonisation-dag branch from 6750ea0 to aeab7fa Compare October 22, 2024 15:38
@project-defiant project-defiant changed the title Szsz gwas catalog harmonisation dag feat(gwas_catalog) sumstat harmonisation dag Oct 23, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant