Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add roaring_bitmap_memory_size_in_bytes(), with C++ interfaces #409

Open
wants to merge 1 commit into
base: master
Choose a base branch
from

Conversation

madscientist
Copy link
Contributor

No one commented on issue #406 so I just created this. We definitely need this, but not sure if it's too specific a need for others.

The new roaring_bitmap_memory_size_in_bytes() function returns the number of in-memory bytes currently used by this Roaring bitmap.

Add getMemorySizeInBytes() methods to the C++ Roaring and Roaring64Map classes. Note that the Roaring64Map result is somewhat guesswork since we can't accurately compute the memory used by the STL std::map implementation.

The new roaring_bitmap_memory_size_in_bytes() function returns the
number of in-memory bytes currently used by this Roaring bitmap.

Add getMemorySizeInBytes() methods to the C++ Roaring and Roaring64Map
classes.  Note that the Roaring64Map result is somewhat guesswork since
we can't accurately compute the memory used by the STL std::map
implementation.
@kosak
Copy link
Contributor

kosak commented Nov 15, 2022

Just adding my 2 cents here: f it starts to feel like the API is getting too unwieldy (hard to discover, too many entry points, etc), there are a couple of different approaches to use.

One approach is to move things like this into free functions with a separate header. So (for C++) you'd do something like

roaring64_diagnostics.h:

class Roaring64Diagnostics {
public:
  static size_t getMemorySizeInBytes(const Roaring64Map &map) const { ... }
};

and then this can be a friend of Roaring64Map but regular users don't need to see it.

And for C you would use related techniques to move the implementation into their own diagnostics.h and diagnostics.c files.

The other approach is to write one kind of "visitor" function that iterates over the internals of both the C++ stuff and C stuff and calls back to a user callback that can do whatever it wants (For your case, that would be counting bytes).

Just running some ideas up the flagpole here in case people are concerned about growth of the API surface.

// A common red/black tree implementation has 3 pointers plus 2 ints
// per element, plus the size of the pair. The size of the Roaring
// struct is included in roarings.getMemorySizeInBytes() so remove it.
constexpr size_t perEntry = 3 * sizeof(void*) + 2 * sizeof(int) + sizeof(std::pair<uint32_t, Roaring>) - sizeof(Roaring);
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree that counting bytes is hard. I wonder if this calculation has to have an accommodation for the "new" header... since map nodes are relatively small, the amount of header added by the storage allocator could be significant. I would imagine it's the size of a couple of pointers at least but I don't actually know.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, deciding exactly what to count is not always simple. Do you count the header? Do you count all the allocated bytes not just the "usable" bytes (often allocators will give a chunk of memory of a set size rather than the exact number of bytes)? These things are essentially not countable, at least not in a portable way (some systems have malloc_usable_size() which returns the latter value). I don't know that the first value is of much interest in general although of course if you are allocating tons of small things, it could be non-trivial. In general in that situation you'd probably move to something like jemalloc which uses a very different allocation model than the "traditional" one of allocating n+X bytes and returning a pointer into the buffer.

@madscientist
Copy link
Contributor Author

Re a visitor function: I'm not really sure how that could work. How can I "count bytes" from an external visitor function, without knowing all the internals: I need to know what type of container it is, and be able to extract the capacity, in order to figure out how many bytes it uses. It seems to me like if you already have all that information you're there's not much point in trying to move this out into a separate facility and incurring the overhead of a visitor function call per container. Maybe you didn't mean that the roaring user would be able to do this, but rather an implementation detail?

The implementation is one extra method so it's not much bloat. If there are a bunch of methods already existing which could be broken out into a separate interface and this was the straw that caused that to happen then that would be good. Else, I don't think inventing this infrastructure for this one function, before we know if there will be a lot more of this type of thing, would be worthwhile.

YMMV though!

@kosak
Copy link
Contributor

kosak commented Nov 18, 2022

To respond to your points: Yes, it would indeed be a visitor that iterates over the internals (as I said). Yes, I was assuming it would be available to roaring users, but they would need to be "power users" who are familiar with the internals and who can tolerate being exposed to the internals changing over versions. No, I don't think the overhead of per-container callbacks would be material. Yes, it's a bit of a weird idea and I can see that it would be a lot more palatable if there were two obvious use cases that would benefit from it rather than just one. Regarding motivation I think one should be cautious to put in functionality that is basically a "guesstimate", because it becomes part of the API and needs to be supported. For those reasons I was trying to brainstorm ways for you to get the hooks you need to calculate your guesstimate without having the project having to own that calculation forever.

@kosak
Copy link
Contributor

kosak commented Nov 18, 2022

Coming at it from a different perspective, maybe there are users who want a pluggable memory allocator, and the functionality you want could easily ride on top of that.

@madscientist
Copy link
Contributor Author

Regarding "guesstimates", the 32bit count is accurate. Unfortunately it's simply not possible to know the value exactly in Roaring64Map (maybe if/when we ever get C implementations of 64bit Roaring bitmaps and Roaring64 is just a wrapper on that, we can compute this accurately) but it's in the ballpark. If you mean it doesn't count the allocator overhead, I don't really want to count that.

I'm pretty sure it's not feasible for a pluggable memory allocator to compute this value, unless you were somehow going to say that every bitmap would live in its own tagged memory silo or something like that. A memory allocator has no idea that this instance of malloc() is obtaining memory for that instance of roaring.

Anyway, we are maintaining this patch for our needs. They may be too esoteric for general use, which is fine.

@kosak
Copy link
Contributor

kosak commented Nov 19, 2022

No, I was thinking that every bitmap would have (in C++) an optional pointer to a stateful allocator, and in C they would have the equivalent thing (function pointer plus void* for state). Then (in one extreme) each bitmap could have its own allocator, or (in the other extreme) all bitmaps could point to the same allocator, or anything in between.

By the way, I'm just trying to post some ideas and alternatives and constructive skepticism here. If the community or owners want this change, it's truly not my intention to discourage it.

@madscientist
Copy link
Contributor Author

It could be done but I'm pretty sure there was a long discussion a while back about custom allocators for Roaring and it was decided that something like this (adding two function pointers to every bitmap) was too much overhead. I could be misremembering though.

@lemire
Copy link
Member

lemire commented Nov 21, 2022

Bitmaps are made of containers, these individual containers can be shared between bitmaps, or moved from one bitmap to another. And some users use memory-file mapping. So we would need to carefully track and follow almost every allocated memory region. We also do not want users who do not need this feature to pay for it with overhead. Finally, it should not affect the maintainability of our code base too much, especially because it is a feature that few people need.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants