E (95% range scan, 5% insert) performance #2

JohannesLichtenberger · 2020-03-12T19:22:40Z

Hi,

do you have any idea why range scans for 64Bit Integers are faster in Red/Black trees (TreeMap)?

BTW: Regarding the height they improved their trie with HOT, which especially should address the issue to reduce the height. I think they now use tries within the trie, basically.

Kind regards
Johannes

rohansuri · 2020-03-31T18:13:45Z

Thanks for your question. I didn't investigate it before, but now have. Here's my understanding:

TLDR

The reason is subtly stated in another paper.

ART outperforms the other indexes for all workloads and key
types except Scan/Insert, where its iteration requires more memory
access than the OpenBw-Tree.

Longer version

One way to intuitively think about why ART requires more memory accesses is that user keys are only stored in the leaf nodes whereas in RB trees every node stores a key. Which means ART has additional supporting links/nodes to get to those leaf nodes. Hence any traversal that requires you to go through a lot of next-user-keys would require passing through those supporting links/nodes.

I ran some experiments with the dataset of 100,000 random 64 bit integers. RBtree will have 2n links i.e. 200,000 since it has a node for every user key and every node has a link to its left, right child.
For ART it depends on the dataset. In this case the node counts and links are:

Node	#Nodes	#Links	#Nodes per level
Node4	20785	83140	L3=20193 L4=592
Node16	6252	100032	L3=6252
Node48	0	0
Node256	129	33024	L1=1, L2=128
LeafNode	100000		L3=4786, L4=94029, L5=1185

In total 216,196 links. Hence ART has about 16k more links than RB tree. More links means more load instructions and likely more cache misses. Height of ART is 5 whereas RBtree's height is 20. Point queries only require traversing the height hence ART is faster.

Range scan is a ceilingEntry call + a number of successor calls depending on scan length. The ceilingEntry involves a depth traversal just as in get and in some cases if you reach the bottom, a successor call. A scan of length 0 will only require a ceilingEntry call largely involving the depth traversal, in which case ART should be faster because of smaller depth. As we increase the scan length, more of those supporting links need to be traversed. For example if the present user key came from the 2nd child of a Node4, the next user key is simply the next child in the same Node4. Similarly the next child. After this we're required to visit Node4's parent and go through the supporting links. And this seems to be the case for this dataset as most of the leaf nodes are at L4 and have Node4 as parents on L3. Which means scan length of 4 should be the turning point and from the plots it does look like it is.

Plots of perf events.

For better understanding of the traversal, here's a raw dump of the moves taken to get to the next key for a single YCSB scan request of length 50 in both ART, TreeMap.

It'd be nice to somehow derive a theoretical worst case bound for ART for the number of links visited depending on n, span, length and maybe a distribution factor. In case of RBTree it is straightforward as the tree structure is dependent only on n.

Hope that explains why ART is slower for range scans as compared to RBTree. Let me know if you have further questions.

JohannesLichtenberger · 2020-04-01T06:59:51Z

2n links for binary trees? Isn't it n + 1 (with null links)?

Hm, I think you're right. However, in binary trees you also have to go up even more links sometimes during range-scans.

But I think in comparison to a binary tree a cache oblivious B+ tree might be even way better because of the locality and CPU caches. That is, I don't know how a cache oblivious indirect binary tree compares (with just one array and without direct pointers), which is built using the Van Emde Boas layout.

BTW: Any plans to kind of "advance" your ART to a HOT?

https://dbis.uibk.ac.at/sites/default/files/2018-04/hot-height-optimized-author-version.pdf

JohannesLichtenberger · 2020-04-01T07:42:16Z

On another note, do you have any recommendations about the JVM implementations and hardware support (for instance SIMD instructions)? Seems in-memory it's now all about using cache-lines efficiently, but I don't know what's possible with JVM based languages nowadays. Also, I'm not sure if it makes sense to use Azul Zing or the GraalVM (the latter makes sense for me, for instance as I'm using lambdas often times, and small intermediate objects seem to be not that costly).

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

E (95% range scan, 5% insert) performance #2

E (95% range scan, 5% insert) performance #2

JohannesLichtenberger commented Mar 12, 2020

rohansuri commented Mar 31, 2020

JohannesLichtenberger commented Apr 1, 2020 •

edited

Loading

JohannesLichtenberger commented Apr 1, 2020 •

edited

Loading

E (95% range scan, 5% insert) performance #2

E (95% range scan, 5% insert) performance #2

Comments

JohannesLichtenberger commented Mar 12, 2020

rohansuri commented Mar 31, 2020

TLDR

Longer version

JohannesLichtenberger commented Apr 1, 2020 • edited Loading

JohannesLichtenberger commented Apr 1, 2020 • edited Loading

JohannesLichtenberger commented Apr 1, 2020 •

edited

Loading

JohannesLichtenberger commented Apr 1, 2020 •

edited

Loading