Skip to content

Commit

Permalink
chore(dataobj): add bitmap encoding
Browse files Browse the repository at this point in the history
This commit adds bitmap encoding, the third and final type of encoding
needed for the data object prototype.

Bitmap encoding efficiently stores sequences of uint64 values in a
combination of RLE runs or bitpacked runs. RLE runs are long sequences
of the same value, while bitpacked runs are runs of 8 values packed
together into the smallest possible bit width.

Bitmap encoding is based off of the RLE encoding format used by Parquet,
with some notable changes to facilitate streaming writes:

- Our bitmap encoding doesn't use a fixed width for values. Instead, the
  width is determined upon flushing a bitpacked set. Bitpacked sets of
  the same width are then combined into a single run.

  This comes at the cost of an extra byte per bitpacked run.

- As values are streamed, the final length of the bitmap isn't included
  to the bitmap header. Callers can choose to prepend the length by
  writing the bitmap into a separate buffer and then writing a custom
  header. Without this, readers must take caution to know the exact
  number of encoded values to not read past the end of the RLE sequence.

This code is unfortunately quite complex. I've tried to add comments for
as much as I could, but if there's an easier way to do the bitpacking, I
would love to move over to that.
  • Loading branch information
rfratto committed Jan 7, 2025
1 parent 9a21590 commit d83c47f
Show file tree
Hide file tree
Showing 4 changed files with 886 additions and 10 deletions.
Loading

0 comments on commit d83c47f

Please sign in to comment.