Ingestion of external data using Google Cloud Batch

Introduction

Spark on Dataproc works best when it is fed the data which is already in Parquet format, adequately partitioned, and is available to read from a Google Storage bucket.

However, almost all external data sources do not fit those criteria. This leads to very complicated and inefficient pipelines where Spark struggles to read large, non-partitioned, gzip-compressed TSV or CSV files.

The collection of modules in this repository solves this problem by implementing a separate preprocessing, non-Spark step running on Google Batch. The data is not modified, but is appropriately repartitioned and stored in Parquet format inside Google Storage.

Commands

All commands need to be run relative to the root of the repository.

List data sources available for ingestion

./submit.py --help

Ingest a specific data source

./submit.py eqtl_catalogue

This command will submit a batch job to Google Cloud Batch. You can monitor the progress using the web interface at https://console.cloud.google.com/batch/jobs.

Outputs

Code is deployed to gs://gentropy-tmp/batch/staging/code.

Logs for each run will be saved to gs://gentropy-tmp/batch/staging/logs.

The outputs will be saved to gs://gentropy-tmp/batch/output.

Name		Name	Last commit message	Last commit date
Latest commit History 14 Commits
LICENSE		LICENSE
README.md		README.md
batch_common.py		batch_common.py
cli_common.py		cli_common.py
data_sources.py		data_sources.py
ingest.py		ingest.py
preprocess.png		preprocess.png
requirements.txt		requirements.txt
resilient_fetch.py		resilient_fetch.py
runner.sh		runner.sh
spark_prep.py		spark_prep.py
submit.py		submit.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Ingestion of external data using Google Cloud Batch

Introduction

Commands

List data sources available for ingestion

Ingest a specific data source

Outputs

About

Releases 1

Packages

Languages

License

opentargets/gentropy-input-support

Folders and files

Latest commit

History

Repository files navigation

Ingestion of external data using Google Cloud Batch

Introduction

Commands

List data sources available for ingestion

Ingest a specific data source

Outputs

About

Resources

License

Stars

Watchers

Forks

Releases 1

Packages 0

Languages

Packages