-
-
Notifications
You must be signed in to change notification settings - Fork 113
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimizing the node location store #151
Comments
For the record, this is the arraylayout post part 2 and its paper: In case you see value in going with a different array layout solution for libosmium, It would be great to have something like that as a free standing library / header (similar to protozero fo example). Prior art:
We certainly have the use case for these layouts coming up every now and then, too. /cc @TheMarex |
@daniel-j-h The existing "index" code in libosmium is templetized, so it can be used not only for ID->Location lookups, but also for ID->ID lookups or similar things. So this should already be generic enough for many use cases that come up when working with OSM data and might be usable in OSRM, too. |
After coding all of #343 I realized I have implemented some ideas discussed in this issue. |
I see this is an older ticket, but I assume it's still an area of interest, since it's open. In tilemaker, I've implemented a node store for PBFs with
The sketch of the approach is that nodes are divided into a three-level hierarchy: Level 1 is groups: there are 256K groups This allows storing up to 2^34 nodes with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.) Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks. In order to know which chunks/nodes are present in the group/chunk, each group/chunk has a 32-byte bitmask. We use popcnt to determine the physical offset in which a given chunk/node is stored. We optionally support using streamvbyte to compress the nodes. It's close to (but not quite) free in terms of CPU usage. When constructing the store, each worker operates on a run of sequential PrimitiveBlocks. For example, in the Great Britain case, there are ~23,000 blocks of nodes. If we have 16 threads, we might assign blocks 1..1437 to thread 1, blocks 1438..2875 to thread 2, and so on. In this way, each thread can generally be confident that it "owns" a given group, and it only needs working memory to track the 65,536 nodes contained in that group before publishing them into the store. There are exceptions: the groups seen in the first and last PrimitiveBlocks of each thread. We can't be confident that the thread is the only thread seeing nodes from such groups. Those nodes are set aside and published into the store at the end of creation. We've found this to be a good general-purpose solution that scales up and down on different sizes of PBFs. Here are some example sizes, with compression and then without compression:
In the Monaco case, the overhead of the 32-byte bitmask in the group and chunk structures dominates, which is why we have 13.5 bytes/node. But that's fine: as the total number of nodes is small, the total memory used is small. Would you be interested in a PR adding this to libosmium? It is more complex than the existing stores, so has a higher burden to maintain. It also requires a property that I think libosmium does not currently have: the ability for a handler to operate on sequential run of PrimitiveBlocks. Thus, adding this would require first adopting some change to make that possible. I'm not very familiar with libosmium, so can't propose anything too good there. In tilemaker, this is addressed by doing a first pass of the PBF that reads the libosmium is a bit different: the coordinator reads the There are pros and cons to each approach. Tilemaker's approach adds some I/O during the initial pass to find PrimitiveBlocks. In the case of an SSD, this is manageable - we're issuing a lot of iops as we jump around, but we're only reading a small amount of bytes, not the whole PBF. Tilemaker also only concerns itself with PBFs. This idea could also be extended to XML extracts that have this setting--divide the file into N parts, seek backwards until you find a top-level tag, start parsing from there. Unless libosmium already has some facility for that, I don't propose to add it/do anything here. And, now that I've reached the end of this comment, I think I realize that perhaps this is a poor fit for libosmium for a different reason. In tilemaker, each thread does both reading the PBF and inserting into the store. In libosmium, I think (? confirmation appreciated, I'm not very familiar with the code) it's more like:
But ideally, we'd like the store construction code to be happening on N threads, not 1 thread. If I'm understanding things correctly, this would be a much bigger architectural change--I'm not sure if you'd be interested in that. Anyway, I've written it all out, so may as well post it and see your thoughts. :) |
Reading the code a bit more closely, it looks like libosmium maintains the order of blocks when calling the handler. Since the handler only runs on a single thread, that then means that I was wrong here:
That's great, I suspect that means this store could be implemented in libosmium as it stands today. This raises confusion in my mind about libosmium's multithreading story, but I'll open a separate issue to inquire about that. |
I have implemented a node location store that's "better" than those available in libosmium for osm2pgsql (code) that is somewhat similar to yours. "better" of course depends on use cases. I'd love to have more implementations for node location stores in libosmium, either optimized for specific use cases or ideally a better generic one that works with input files of all sizes. There are several dimensions to keep in mind for this. Some inputs are sorted, some are not. Sometimes you need to only build the index once and then query it afterwards. Sometimes you need to allow updates, possibly interleaving queries and updates. And sometimes you need this in memory only, sometimes on disk for a more permanent store. Any solution doesn't have to solve all of these, but it has to be really clear about the cases it works for. And I have been reluctant to add more external dependencies. External libraries come and go, some don't work with C++11. Some are available only on some systems. Of course the situation has changed a lot since 2016 when I created this issue so maybe some library I didn't want to use back then could be used now. (And we don't have to be stuck at C++11 forever either). streamvbyte seems to be available in recent Debian/Ubuntu and Homebrew but not for most other Linux distros and not for vcpkg, so kind of a mixed bag here regarding availability. We'd need to have a closer look at what it brings to the table compared to the varint code from protozero (which we have anyway for the PBF parsing). |
Not sure if it helps, but I am happy using this. I didn't write it but it never disappointed me when I used it (large and small datasets). |
@felixguendling Interesting approach. Can you say something about the size of the resulting data/index? |
I'm currently experimenting with a planet file from a few months ago and that's what it produced:
|
Okay, so about half the size of a flat array-type storage file. That's not bad. And it will be smaller for smaller datasets which is also good of course. Problem could be the lookup time. If I saw this correctly in the code, it stores at most 1024 node locations in one "block", which means you have to decode, on avarage, about 1000 varints before you get to the one location you are interested in. And decoding varints is quite expensive. |
That's true. It helps that often you're not only interested in only one location but several locations and nodes referenced in a way tend to be stored in the same block (not always, but often). So in practice you usually don't pay the decoding of a block for only one location. Still, it makes sense to parallelize the decoding like it's done here. In the end, it's always a trade-off. An example how |
Profiling shows that the
sparse_mmap_array
node location store is faster than others for small and medium datasets, but it still takes a huge amount of processing time. This is probably due to the binary search over a large array of data that totally defeats the CPU cache, basically every memory access is a cache fail.This could probably be optimized. Here are some ideas:
And here some background reading:
We have to try different ideas and benchmark them.
The text was updated successfully, but these errors were encountered: