Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

CGHub download #38

Open
wants to merge 9 commits into
base: master
Choose a base branch
from
Open

CGHub download #38

wants to merge 9 commits into from

Conversation

jfeala
Copy link
Contributor

@jfeala jfeala commented Apr 20, 2015

Hi Uri

Here is a bit of code to follow-up to our emails. This is untested and not ready for merge, but I wanted to get your feedback before continuing.

CGHub is hosting > 1200 public BAM files from the Cancer Cell Line Encyclopedia that are available only through their GeneTorrent download client. TCGA BAM files, while not available publicly, can be downloaded using the same framework for anyone with an authorized key file.

The dependencies for installing GeneTorrent are tricky on CentOS so that part would require the most attention. The other changes are straightforward.

I invented a url of the form cghub://<analysis_id>/<filename> to distinguish these downloads but we should think about whether there is a better way.

Jake

@jfeala
Copy link
Contributor Author

jfeala commented Apr 20, 2015

... and a little background: https://cghub.ucsc.edu/datasets/ccle.html

@fnothaft
Copy link
Member

+1

{
"name": "ccle-wgs",
"description": "Cancer Cell Line Encyclopedia whole genome sequencing",
"target": "ccle/wgs",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: We've updated the format recently, so that target is no longer used (see https://github.com/bigdatagenomics/eggo/blob/master/docs/spec.md), so this line can be omitted. (I still need to update the other files in the registry).

@tomwhite
Copy link
Member

This looks good to me. It might be helpful to add a file in test/registry to make it easy to try out on a single file from CGHub.

@jfeala
Copy link
Contributor Author

jfeala commented Apr 23, 2015

I updated the registries according to your suggestions, and added the other public dataset available on CGHub, a benchmarking dataset published for the purpose of testing mutation callers. Unfortunately the smallest file from both studies is a ~5Gb RNA-seq BAM, so I used that one as the test registry.

Still testing the ETL code, not ready for merge yet

@jfeala
Copy link
Contributor Author

jfeala commented Apr 24, 2015

Ready for merge. GeneTorrent installation and CGHub download functions are tested. I couldn't get the full luigi DAG working, but it seems like that's a work in progress.

cghub_key = os.environ.get('CGHUB_KEY') or CGHUB_PUBLIC_KEY

# 2. Parse url for analysis ID and filename
analysis_id, filename = url.lstrip('cghub://').split('/')
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Does the split ever produce more than two objects?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

analysis_id is a CGHub concept?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes, the CGHub metadata store centers around the analysis_id. It refers to a single downloadable object (which may contain multiple files). I made up a cghub url to fit the existing registry structure, so it is easy to change. Right now I am creating them to only contain the analysis ID and a single filename of interest (BAM file, generally), so this would always split into 2 objects.

However, the CGHub REST API returns a JSON with lots of metadata for a given analysis_id. One option would be to store this full JSON in the registry, though it would be long and cluttered, not as easily human-readable. Or we could just use the analysis ID and have the code call the API to get the filename, filesize, and other metadata if necessary.

Jake Feala added 3 commits May 1, 2015 22:57
Standardize behavior of cghub, ftp, and http download functions
- flatten the cghub download directory and omit return value
- rename http_download to curl download to capture ftp use case
Also add “editions” field to test-cghub.json
@jfeala
Copy link
Contributor Author

jfeala commented May 2, 2015

Ok, fixed it according to your suggestions (although the awscli globbing issue is now moot after recent updates to master). Let me know if you prefer a rebase

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants