Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[question] any suggestion about compressing serialized data? #10

Open
toien opened this issue Mar 18, 2022 · 5 comments
Open

[question] any suggestion about compressing serialized data? #10

toien opened this issue Mar 18, 2022 · 5 comments

Comments

@toien
Copy link

toien commented Mar 18, 2022

suppose i have multiple clients implemented in different languages(go, java...), and all of them need to download roaringbitmap from server through http.

when bitmap gets large(about 10 millions) and serialized binary data comes out about 20 MB, i think compress before sending may save a lot of transmission time.

i am trying use gzip. any suggestions about compressing?

thanks in advance

@toien toien changed the title any suggestion about compressing serialized data? [question] any suggestion about compressing serialized data? Mar 18, 2022
@lemire
Copy link
Member

lemire commented Mar 18, 2022

Do you get good results with gzip?

I have no experience compressing roaring bitmaps with generic codecs... assuredly, the impact will be data specific...

Related:

Compressing JSON: gzip vs zstd
https://lemire.me/blog/2021/06/30/compressing-json-gzip-vs-zstd/

@toien
Copy link
Author

toien commented Mar 21, 2022

thanks for fast reply!

i write a test (java/golang), populate roaringbitmap with random uint32 and serialize it, also using gzip compress it, but it tunrns out data almost not compressed.

// populate with random data
Random r = new Random();
RoaringBitmap rbm = new RoaringBitmap();

for (int i = 0; i < size; i++) {
  long rValue = r.nextLong() & 0xffffffffL;
  int casted = (int) rValue;
  rbm.add(casted);
}

// dump to disk
ByteBuffer buffer = ByteBuffer.allocate(rbm.serializedSizeInBytes());
rbm.serialize(buffer);

Path path = Paths.get(filepath);

Files.write(path, buffer.array());

// compress and dump
Path cpath = Paths.get(compressedFilepath);

Files.write(cpath, new byte[0], StandardOpenOption.CREATE, StandardOpenOption.TRUNCATE_EXISTING);

try (GZIPOutputStream gos = new GZIPOutputStream(new FileOutputStream(cpath.toFile()))) {
  gos.write(buffer.array());
}

here is result:

> ls -alht
-rw-r--r--   1 worker  staff    19M Mar 21 15:59 random-1000w.bin.gz
-rw-r--r--   1 worker  staff    20M Mar 21 15:59 random-1000w.bin

i am trying zstd

@toien
Copy link
Author

toien commented Mar 21, 2022

it seems that generic compress not fit for roaringbitmap

INFO: lz4 decompressed len: 20501162, compressed len:20574769
INFO: zstd decompressed len: 20501162, compressed len:20415315

@lemire
Copy link
Member

lemire commented Mar 21, 2022

Interestingly, it looks like lz4 makes things worse in your test!

@derlaft
Copy link

derlaft commented Jul 25, 2023

RoaringFormatSpec : specification of the compressed-bitmap Roaring formats

roaring bitmaps are already a type of compression. therefore the entropy of the serialized data should already be rather close to the maximum (you can make an entropy graph for example using binwalk) and compressing it once more won't yield a significant result

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants