We offer preliminary cloud support for the WASP and counts pipelines. Follow these instructions to run those pipelines on GTEX data.
This setup generally follows Snakemake's Google Life Sciences Tutorial.
- Setup the config file according to the directions in the WASP README
- Add your Google Application Credentials and enable the requisite APIs
- Create a cloud storage bucket and upload the following from your config file
- the
chrom_info
file - the
gene_info
file - the
ref_genome
file
- the
- Use your ERA Commons Account to get access to the GTEX data through a Terra-based Google Cloud storage bucket
- Copy the following GTEX data from the Terra bucket to your own:
- the VCF and its
.tbi
index (to the path in your config file) - the BAM samples and their
.bai
index (to themap1_sort
folder within the config file'soutput_dir
)
- the VCF and its
- Run the pipeline!
./run-gcp &
Some of the steps within our pipeline will not run properly on GCP. We have refrained from changing our pipeline to accommodate these problems because they are mostly related to bugs within Snakemake (or other things that we expect to improve with time).
issue | affected rules | workaround |
---|---|---|
directory() output | create_STAR_index (and all downstream steps) | perform this step manually on your cluster, then reupload to the storage bucket |
checkpoints | vcf_chroms, split_vcf_by_chr, vcf2h5 | download the VCF to your cluster, perform the affected steps manually, then reupload to the storage bucket; place a chroms.txt file where it is missing within the storage bucket to temporarily satisfy Snakemake and silence MissingInputExceptions |
WASP | get_WASP, install_WASP | add WASP to your working directory (perhaps under git version control, so that Snakemake knows to upload it to GCP) |
- Your rules might run out of disk space or memory. Try raising the default in
run-gcp
. In the future, the default might be smarter. - The Snakemake documentation recommends using
gcloud beta lifesciences operations describe
to view the stderr of failed steps in the pipeline. Unfortunately, this will only give you the last 10 lines of the stderr. Snakemake will create files containing the full stderr inside your storage bucket. The path to these files will be specified in your locallog/log
file after "Logs will be saved to..."
There are a number of cloud-related features for the pipeline that might be in-the-works. Check them out!
- There is not a lot of documentation available to explain how Snakemake interacts with the Life Sciences API and Compute Engine. These slides might help.
- Documentation for the Cloud Life Sciences API
- A script that can help you debug your jobs in real time