-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Performance on large files - avoid spilling to disk #4
Comments
Decompression does not spill to disk. If it did, it would be problematic, as you say. Compression uses temporary disk storage though, which may be not ideal. Indeed, I consider extending the format to support multiple blocks. My main reason for this is not to avoid disk usage in compression, which I see as acceptable. But more importantly, multi-block format will enable faster random access - i.e., partial decompression of just specified sequences or coordinate range. This has to be carefully designed and harmonized with other planned features. So it may take me some while. But I'm very positive to extending the format in that direction. |
You could set --temp-dir to /dev/shm/ to effectively write to temporarily to RAM and avoid additional IO. Interesting discussion on how to proceed with random access. I may need this for something else :) |
@yhoogstrate I somehow missed this comment. It's a good idea to use /dev/shm/ when possible. I added it to the manual ( https://github.com/KirillKryukov/naf/blob/develop/Compress.md#temporary-storage ). Thanks! |
I will try to make a PR resolving this soon. |
Looking through the source code and specifications document, I've noticed that both compress and decompression spill to disk for large files. This is particularly problematic in the decompression scenario due to the high temporary disk usage.
Have you considered extending the file format to support multiple blocks? For example:
Header = format descriptor, format version, sequence type, flags, name separator, line length
DataBlock = Number of sequences, IDs, Comments, Lengths, Mask, Sequence, Quality
And the overall structure:
Header, Title, [DataBlock]+
Then you could stream NAF files with no disk usage and a fixed memory overhead. There is a slight compression penalty to having multiple data block but that will be trivially low for large data blocks. Both BAM and CRAM uses variants of this blocked compression approach.
The text was updated successfully, but these errors were encountered: