Performance on large files - avoid spilling to disk #4

d-cameron · 2019-05-21T14:11:27Z

Looking through the source code and specifications document, I've noticed that both compress and decompression spill to disk for large files. This is particularly problematic in the decompression scenario due to the high temporary disk usage.

Have you considered extending the file format to support multiple blocks? For example:

Header = format descriptor, format version, sequence type, flags, name separator, line length

DataBlock = Number of sequences, IDs, Comments, Lengths, Mask, Sequence, Quality

And the overall structure:

Header, Title, [DataBlock]+

Then you could stream NAF files with no disk usage and a fixed memory overhead. There is a slight compression penalty to having multiple data block but that will be trivially low for large data blocks. Both BAM and CRAM uses variants of this blocked compression approach.

KirillKryukov · 2019-05-22T02:57:28Z

Decompression does not spill to disk. If it did, it would be problematic, as you say. Compression uses temporary disk storage though, which may be not ideal.

Indeed, I consider extending the format to support multiple blocks. My main reason for this is not to avoid disk usage in compression, which I see as acceptable. But more importantly, multi-block format will enable faster random access - i.e., partial decompression of just specified sequences or coordinate range.

This has to be carefully designed and harmonized with other planned features. So it may take me some while. But I'm very positive to extending the format in that direction.

yhoogstrate · 2020-01-31T09:30:05Z

You could set --temp-dir to /dev/shm/ to effectively write to temporarily to RAM and avoid additional IO.

Interesting discussion on how to proceed with random access. I may need this for something else :)

KirillKryukov · 2021-11-29T02:51:15Z

@yhoogstrate I somehow missed this comment. It's a good idea to use /dev/shm/ when possible. I added it to the manual ( https://github.com/KirillKryukov/naf/blob/develop/Compress.md#temporary-storage ). Thanks!

yhoogstrate · 2021-12-30T18:54:13Z

@yhoogstrate I somehow missed this comment. It's a good idea to use /dev/shm/ when possible. I added it to the manual ( https://github.com/KirillKryukov/naf/blob/develop/Compress.md#temporary-storage ). Thanks!

I will try to make a PR resolving this soon.

KirillKryukov added the enhancement New feature or request label Oct 22, 2019

yhoogstrate mentioned this issue Dec 30, 2021

Use /dev/shm as temp_dir #16

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance on large files - avoid spilling to disk #4

Performance on large files - avoid spilling to disk #4

d-cameron commented May 21, 2019

KirillKryukov commented May 22, 2019

yhoogstrate commented Jan 31, 2020

KirillKryukov commented Nov 29, 2021

yhoogstrate commented Dec 30, 2021