Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

BGZip slow performance near end of chromosomes #153

Open
davmlaw opened this issue Jul 26, 2019 · 3 comments
Open

BGZip slow performance near end of chromosomes #153

davmlaw opened this issue Jul 26, 2019 · 3 comments

Comments

@davmlaw
Copy link

davmlaw commented Jul 26, 2019

It can take over a minute to retrieve a few bases:

$ time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:117199563-117199564
>7:117199563-117199564
GG

real	1m8.888s
user	0m31.042s
sys	0m37.785s

Low coordinates are fine:

$time faidx --no-rebuild Homo_sapiens.GRCh37.75.dna.primary_assembly.fa.gz 7:10000-11000 > /dev/null

real	0m0.295s
user	0m0.241s
sys	0m0.056s

You said in a previous issue:

there is still a large performance penalty for fetching small substrings near the end of a record, and I'll open an issue to remind me to explore a solution.

I can't find that issue, so am raising this one. Good luck!

@Maarten-vd-Sande
Copy link
Contributor

Seems that the lookup time just scales with the distance from the "start" of a contig. I just quickly scanned the internals, can't say I fully understand, but it seems that this is due to the way bgzip is implemented in biopython:

https://github.com/biopython/biopython/blob/master/Bio/bgzf.py#L699

It seems to read the whole part before the contig you need...?

@mdshw5
Copy link
Owner

mdshw5 commented Feb 11, 2022

@Maarten-vd-Sande this is definitely not due to the Bio.bgzf implementation and is definitely due to my incomplete implementation of virtual offset calculations from the start of each contig. I started work to fully support using the .gzi sidecar files in #164, but have not taken the time to complete the work. From the .fai index we can know how many bases (characters) to skip, and can seek directly to the requested region in an uncompressed FASTA. For BGZF compressed FASTA in current pyfaidx implementation can know which BGZF block to seek to the beginning of a contig, but without implementing logic to incorporate the .gzi (which tells us the internal BGZF block structure of a contig) the safest thing we can do is start reading from the beginning of the sequence. This is not ideal. The alternative is to seek to the BGZF block nearest to the coordinate of interest, and then start reading from the beginning of that block. This is the relevant code that I wrote 1.5 years ago:

pyfaidx/pyfaidx/__init__.py

Lines 766 to 776 in f878775

if self._bgzf:
from Bio import bgzf
insertion_i = bisect.bisect_left(self.gzi_index, bstart)
if insertion_i == 0: # bisect_left already returns index to the left if values are the same
start_block = self.gzi_index[insertion_i]
else:
start_block = self.gzi_index[insertion_i - 1]
within_block_offset = start_block.ustart + bstart
print(self.gzi_index)
print(locals())
self.file.seek(bgzf.make_virtual_offset(start_block.cstart, within_block_offset))

You can see that I was still trying to figure out how this works, and never was able to make an entire round-trip (read a .gzi file and construct and write an identical .gzi file) without some slight errors.

@Maarten-vd-Sande
Copy link
Contributor

@mdshw5 thanks for the reply, that makes sense! I guess I'll just load the whole fasta in memory for now 😄

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants