Executing our pipeline on Google Cloud

We offer preliminary cloud support for the WASP and counts pipelines. Follow these instructions to run those pipelines on GTEX data.

Setup

This setup generally follows Snakemake's Google Life Sciences Tutorial.

Setup the config file according to the directions in the WASP README
Add your Google Application Credentials and enable the requisite APIs
Create a cloud storage bucket and upload the following from your config file
- the chrom_info file
- the gene_info file
- the ref_genome file
Use your ERA Commons Account to get access to the GTEX data through a Terra-based Google Cloud storage bucket
Copy the following GTEX data from the Terra bucket to your own:
- the VCF and its .tbi index (to the path in your config file)
- the BAM samples and their .bai index (to the map1_sort folder within the config file's output_dir)
Run the pipeline!

./run-gcp &

Caveats

Some of the steps within our pipeline will not run properly on GCP. We have refrained from changing our pipeline to accommodate these problems because they are mostly related to bugs within Snakemake (or other things that we expect to improve with time).

issue	affected rules	workaround
directory() output	create_STAR_index (and all downstream steps)	perform this step manually on your cluster, then reupload to the storage bucket
checkpoints	vcf_chroms, split_vcf_by_chr, vcf2h5	download the VCF to your cluster, perform the affected steps manually, then reupload to the storage bucket; place a chroms.txt file where it is missing within the storage bucket to temporarily satisfy Snakemake and silence MissingInputExceptions
WASP	get_WASP, install_WASP	add WASP to your working directory (perhaps under git version control, so that Snakemake knows to upload it to GCP)

Other challenges

Your rules might run out of disk space or memory. Try raising the default in run-gcp. In the future, the default might be smarter.
The Snakemake documentation recommends using gcloud beta lifesciences operations describe to view the stderr of failed steps in the pipeline. Unfortunately, this will only give you the last 10 lines of the stderr. Snakemake will create files containing the full stderr inside your storage bucket. The path to these files will be specified in your local log/log file after "Logs will be saved to..."

Planned features

There are a number of cloud-related features for the pipeline that might be in-the-works. Check them out!

Other resources

There is not a lot of documentation available to explain how Snakemake interacts with the Life Sciences API and Compute Engine. These slides might help.
Documentation for the Cloud Life Sciences API
A script that can help you debug your jobs in real time

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

README.gcp.md

README.gcp.md

Executing our pipeline on Google Cloud

Setup

Caveats

Other challenges

Planned features

Other resources

Files

README.gcp.md

Latest commit

History

README.gcp.md

File metadata and controls

Executing our pipeline on Google Cloud

Setup

Caveats

Other challenges

Planned features

Other resources