Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
This commit adds bitmap encoding, the third and final type of encoding needed for the data object prototype. Bitmap encoding efficiently stores sequences of uint64 values in a combination of RLE runs or bitpacked runs. RLE runs are long sequences of the same value, while bitpacked runs are runs of 8 values packed together into the smallest possible bit width. Bitmap encoding is based off of the RLE encoding format used by Parquet, with some notable changes to facilitate streaming writes: - Our bitmap encoding doesn't use a fixed width for values. Instead, the width is determined upon flushing a bitpacked set. Bitpacked sets of the same width are then combined into a single run. This comes at the cost of an extra byte per bitpacked run. - As values are streamed, the final length of the bitmap isn't included to the bitmap header. Callers can choose to prepend the length by writing the bitmap into a separate buffer and then writing a custom header. Without this, readers must take caution to know the exact number of encoded values to not read past the end of the RLE sequence. This code is unfortunately quite complex. I've tried to add comments for as much as I could, but if there's an easier way to do the bitpacking, I would love to move over to that.
- Loading branch information