Enable pyfaidx to accept gcp paths for compressed fasta files #161

archanaraja · 2020-05-21T21:27:25Z

Hi,
Currently Im getting an error , loading compressed fasta files from gcp paths, it would be great if this feature in enabled like Pandas. Are there any plans to have it in the near future thanks.
Archana

mdshw5 · 2020-05-30T00:13:00Z

Hey @archanaraja thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?).

archanaraja · 2020-05-30T03:18:32Z

Thank you for the prompt response, I was trying to use the kit to compute length of fastqs in a file as a qc , since it requires indexing as they are not present it looks like there is not an easy fix. All my data is on gcp , pandas package could directly read files from GCP looks like I should use byte ranges like you suggested to make faidx work. Thanks for the detailed explanation. Archana From: Matt Shirley <[email protected]> Reply-To: mdshw5/pyfaidx <[email protected]> Date: Friday, May 29, 2020 at 5:13 PM To: mdshw5/pyfaidx <[email protected]> Cc: Archana Natarajan Raja <[email protected]>, Mention <[email protected]> Subject: Re: [mdshw5/pyfaidx] Enable pyfaidx to accept gcp paths for compressed fasta files (#161) Hey @archanaraja<https://github.com/archanaraja> thanks for raising this issue and apologies for the late response. I'm not familiar with the Google Cloud Storage apis and was not planning to implement this. If you can describe your use case in a bit more detail I may be able to help. It looks like google's python package implements byte ranges<https://googleapis.dev/python/storage/latest/client.html#google.cloud.storage.client.Client.download_blob_to_file>, so assuming the FASTA index file is present I can imagine pyfaidx reading the FAI into memory and then making calls to GCP for the specific sequences we need. If the FASTA is not indexed then pyfaidx would need to stream the entire FASTA and produce an index. That's not too efficient and also brings in the issue of what to do with the newly created FASTA index (do we store it for re-use somewhere or do we rebuild the index from scratch the next time we initialize?). — You are receiving this because you were mentioned. Reply to this email directly, view it on GitHub<#161 (comment)>, or unsubscribe<https://github.com/notifications/unsubscribe-auth/ABSZAPZLI5S6TTHLBNLG6BTRUBFRPANCNFSM4NHF6YJA>.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Enable pyfaidx to accept gcp paths for compressed fasta files #161

Enable pyfaidx to accept gcp paths for compressed fasta files #161

archanaraja commented May 21, 2020

mdshw5 commented May 30, 2020

archanaraja commented May 30, 2020 via email

Enable pyfaidx to accept gcp paths for compressed fasta files #161

Enable pyfaidx to accept gcp paths for compressed fasta files #161

Comments

archanaraja commented May 21, 2020

mdshw5 commented May 30, 2020

archanaraja commented May 30, 2020 via email