Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimizing the node location store #151

Open
joto opened this issue Apr 28, 2016 · 11 comments
Open

Optimizing the node location store #151

joto opened this issue Apr 28, 2016 · 11 comments

Comments

@joto
Copy link
Member

joto commented Apr 28, 2016

Profiling shows that the sparse_mmap_array node location store is faster than others for small and medium datasets, but it still takes a huge amount of processing time. This is probably due to the binary search over a large array of data that totally defeats the CPU cache, basically every memory access is a cache fail.

This could probably be optimized. Here are some ideas:

  • Store node IDs and locations not as pair of data but in separate memory areas. Find ID first, which gives us the offset into the array and then get location with one lookup.
  • Store node IDs not as full 64bit but use several arrays with a subset.
  • Have some kind of compact lookup table with pointers into main table, to narrow down search space.
  • Do linear search instead of binary search when we are "near" the ID we are looking for, for better cache performance.
  • Remember position of last ID(s) we looked up in a mini-cache as starting point for next search. Because node IDs in ways are often close, this could be a huge speedup.

And here some background reading:

We have to try different ideas and benchmark them.

@daniel-j-h
Copy link
Contributor

For the record, this is the arraylayout post part 2 and its paper:

In case you see value in going with a different array layout solution for libosmium, It would be great to have something like that as a free standing library / header (similar to protozero fo example).

Prior art:

We certainly have the use case for these layouts coming up every now and then, too. /cc @TheMarex

@joto
Copy link
Member Author

joto commented Apr 28, 2016

@daniel-j-h The existing "index" code in libosmium is templetized, so it can be used not only for ID->Location lookups, but also for ID->ID lookups or similar things. So this should already be generic enough for many use cases that come up when working with OSM data and might be usable in OSRM, too.

@vlopezferrando
Copy link

After coding all of #343 I realized I have implemented some ideas discussed in this issue.

@cldellow
Copy link

I see this is an older ticket, but I assume it's still an area of interest, since it's open. In tilemaker, I've implemented a node store for PBFs with Sort.Type_then_ID that has these general characteristics:

  • fixed overhead is ~a few MB
  • variable overhead ranges from ~5.5 bytes/node to ~8.5 bytes/node
  • creation does not require sorts, and uses a fixed amount of working memory per thread
  • it is B-tree-esque, so lookups require a fixed number of indirections, not log(nodes) as in the case of a binary search

The sketch of the approach is that nodes are divided into a three-level hierarchy:

Level 1 is groups: there are 256K groups
Level 2 is chunks: each group has up to 256 chunks
Level 3 is nodes: each chunk has up to 256 nodes

This allows storing up to 2^34 nodes with a fixed overhead of only 2M -- the space required for the level 1 pointers. (And 2^35 nodes with fixed overhead of 4M, and so on.)

Groups and chunks store their data sparsely, e.g. if a group has 7 chunks, it only uses storage for 7 chunks. In order to know which chunks/nodes are present in the group/chunk, each group/chunk has a 32-byte bitmask. We use popcnt to determine the physical offset in which a given chunk/node is stored.

We optionally support using streamvbyte to compress the nodes. It's close to (but not quite) free in terms of CPU usage.

When constructing the store, each worker operates on a run of sequential PrimitiveBlocks. For example, in the Great Britain case, there are ~23,000 blocks of nodes. If we have 16 threads, we might assign blocks 1..1437 to thread 1, blocks 1438..2875 to thread 2, and so on.

In this way, each thread can generally be confident that it "owns" a given group, and it only needs working memory to track the 65,536 nodes contained in that group before publishing them into the store. There are exceptions: the groups seen in the first and last PrimitiveBlocks of each thread. We can't be confident that the thread is the only thread seeing nodes from such groups. Those nodes are set aside and published into the store at the end of creation.

We've found this to be a good general-purpose solution that scales up and down on different sizes of PBFs. Here are some example sizes, with compression and then without compression:

North America - 5.52 bytes/node vs 8.48 bytes/node
169482 groups, 18364343 chunks, 1757589784 nodes, needed 9706167278 bytes
169482 groups, 18364343 chunks, 1757589784 nodes, needed 14916095182 bytes

Great Britain - 5.97 bytes/node vs 9.25 bytes/node
163074 groups, 4871807 chunks, 184655287 nodes, needed 1104024510 bytes
163074 groups, 4871807 chunks, 184655287 nodes, needed 1708093150 bytes

Nova Scotia - 5.81 bytes/node vs 8.7 bytes/node
26777 groups, 157927 chunks, 12104733 nodes, needed 70337950 bytes
26777 groups, 157927 chunks, 12104733 nodes, needed 105367598 bytes

Monaco - 10.43 bytes/node vs 13.52 bytes/node
1196 groups, 2449 chunks, 30477 nodes, needed 318114 bytes
1196 groups, 2449 chunks, 30477 nodes, needed 412258 bytes

In the Monaco case, the overhead of the 32-byte bitmask in the group and chunk structures dominates, which is why we have 13.5 bytes/node. But that's fine: as the total number of nodes is small, the total memory used is small.

Would you be interested in a PR adding this to libosmium? It is more complex than the existing stores, so has a higher burden to maintain.

It also requires a property that I think libosmium does not currently have: the ability for a handler to operate on sequential run of PrimitiveBlocks.

Thus, adding this would require first adopting some change to make that possible. I'm not very familiar with libosmium, so can't propose anything too good there.

In tilemaker, this is addressed by doing a first pass of the PBF that reads the BlobHeader structures. We then distribute work based on that, and each worker is responsible for reading the underlying PrimitiveBlock.

libosmium is a bit different: the coordinator reads the Blob, and then distributes those.

There are pros and cons to each approach. Tilemaker's approach adds some I/O during the initial pass to find PrimitiveBlocks. In the case of an SSD, this is manageable - we're issuing a lot of iops as we jump around, but we're only reading a small amount of bytes, not the whole PBF.

Tilemaker also only concerns itself with PBFs. This idea could also be extended to XML extracts that have this setting--divide the file into N parts, seek backwards until you find a top-level tag, start parsing from there. Unless libosmium already has some facility for that, I don't propose to add it/do anything here.

And, now that I've reached the end of this comment, I think I realize that perhaps this is a poor fit for libosmium for a different reason.

In tilemaker, each thread does both reading the PBF and inserting into the store.

In libosmium, I think (? confirmation appreciated, I'm not very familiar with the code) it's more like:

  • 1 thread reads Blobs, distributes them to
  • N threads that decompress and parse the Blob, distributes that to
  • 1 thread that runs a handler

But ideally, we'd like the store construction code to be happening on N threads, not 1 thread. If I'm understanding things correctly, this would be a much bigger architectural change--I'm not sure if you'd be interested in that.

Anyway, I've written it all out, so may as well post it and see your thoughts. :)

@cldellow
Copy link

Reading the code a bit more closely, it looks like libosmium maintains the order of blocks when calling the handler. Since the handler only runs on a single thread, that then means that I was wrong here:

It also requires a property that I think libosmium does not currently have: the ability for a handler to operate on sequential run of PrimitiveBlocks.

That's great, I suspect that means this store could be implemented in libosmium as it stands today. This raises confusion in my mind about libosmium's multithreading story, but I'll open a separate issue to inquire about that.

@joto
Copy link
Member Author

joto commented Jan 4, 2024

I have implemented a node location store that's "better" than those available in libosmium for osm2pgsql (code) that is somewhat similar to yours. "better" of course depends on use cases.

I'd love to have more implementations for node location stores in libosmium, either optimized for specific use cases or ideally a better generic one that works with input files of all sizes. There are several dimensions to keep in mind for this. Some inputs are sorted, some are not. Sometimes you need to only build the index once and then query it afterwards. Sometimes you need to allow updates, possibly interleaving queries and updates. And sometimes you need this in memory only, sometimes on disk for a more permanent store. Any solution doesn't have to solve all of these, but it has to be really clear about the cases it works for.

And I have been reluctant to add more external dependencies. External libraries come and go, some don't work with C++11. Some are available only on some systems. Of course the situation has changed a lot since 2016 when I created this issue so maybe some library I didn't want to use back then could be used now. (And we don't have to be stuck at C++11 forever either). streamvbyte seems to be available in recent Debian/Ubuntu and Homebrew but not for most other Linux distros and not for vcpkg, so kind of a mixed bag here regarding availability. We'd need to have a closer look at what it brings to the table compared to the varint code from protozero (which we have anyway for the PBF parsing).

@felixguendling
Copy link

Not sure if it helps, but I am happy using this. I didn't write it but it never disappointed me when I used it (large and small datasets).

@joto
Copy link
Member Author

joto commented Mar 15, 2024

@felixguendling Interesting approach. Can you say something about the size of the resulting data/index?

@felixguendling
Copy link

I'm currently experimenting with a planet file from a few months ago and that's what it produced:

$ ls -la
total 40G
-rw-rw-r-- 1 felix felix  40G Mär 15 18:38 dat.bin
-rw-rw-r-- 1 felix felix 113M Mär 15 18:37 idx.bin

@joto
Copy link
Member Author

joto commented Mar 16, 2024

Okay, so about half the size of a flat array-type storage file. That's not bad. And it will be smaller for smaller datasets which is also good of course. Problem could be the lookup time. If I saw this correctly in the code, it stores at most 1024 node locations in one "block", which means you have to decode, on avarage, about 1000 varints before you get to the one location you are interested in. And decoding varints is quite expensive.

@felixguendling
Copy link

That's true. It helps that often you're not only interested in only one location but several locations and nodes referenced in a way tend to be stored in the same block (not always, but often). So in practice you usually don't pay the decoding of a block for only one location. Still, it makes sense to parallelize the decoding like it's done here. In the end, it's always a trade-off. An example how hybrid_node_idx changes the performance characteristics can be found here (however not yet parallelized decoding).

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Development

No branches or pull requests

5 participants