-
Notifications
You must be signed in to change notification settings - Fork 75
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Enable pyfaidx to accept gcp paths for compressed fasta files #161
Comments
Hey @archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?). |
Thank you for the prompt response, I was trying to use the kit to compute length of fastqs in a file as a qc , since it requires indexing as they are not present it looks like there is not an easy fix. All my data is on gcp , pandas package could directly read files from GCP looks like I should use byte ranges like you suggested to make faidx work.
Thanks for the detailed explanation.
Archana
From: Matt Shirley <[email protected]>
Reply-To: mdshw5/pyfaidx <[email protected]>
Date: Friday, May 29, 2020 at 5:13 PM
To: mdshw5/pyfaidx <[email protected]>
Cc: Archana Natarajan Raja <[email protected]>, Mention <[email protected]>
Subject: Re: [mdshw5/pyfaidx] Enable pyfaidx to accept gcp paths for compressed fasta files (#161)
Hey @archanaraja<https://github.com/archanaraja> thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges<https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.download_blob_to_file>, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).
—
You are receiving this because you were mentioned.
Reply to this email directly, view it on GitHub<#161 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABSZAPZLI5S6TTHLBNLG6BTRUBFRPANCNFSM4NHF6YJA>.
|
Hi,
Currently Im getting an error , loading compressed fasta files from gcp paths, it would be great if this feature in enabled like Pandas. Are there any plans to have it in the near future thanks.
Archana
The text was updated successfully, but these errors were encountered: