Give `PendingRow` its `BTreeMap` back... or don't? 😶 #8788

teh-cmc · 2025-01-23T12:07:16Z

This is a fun one: it's by all account a mistake, and therefore a bug... but actually it's pretty nice, so I'm not quite sure I want to fix it 😶.

I was looking into the micro-batcher's code for unrelated reasons, when I stumbled upon this:

let mut hasher = ahash::AHasher::default();
row.components
    .values()
    .for_each(|array| array.data_type().hash(&mut hasher));

Looks legit at first glance, except... row.components is a Hashmap. Or a IntMap, rather. And HashMaps have a random iteration order per-instance per-execution (it is then fixed for the lifetime of that instance).
The reason it's a IntMap is because I went a bit too far with my search-n-replace during #8207:

Clean up pass over all superfluous hashing happening on the query path #8207

So, that raises the question... why on earth is the micro-batcher still working? The reason it's still working is that, by virtue of not doing any hashing nor salting, an IntMap ends up with stronger guarantees that a vanilla HashMap. Specifically, all IntMaps that share the same keys have the same iteration order, iff these keys were inserted in the same order.
Put differently, this test always passes:

#[test]
#[allow(clippy::zero_sized_map_values)]
fn intmap_order_is_well_defined() {
    let descriptors = [
        MyPoint::descriptor(),
        MyColor::descriptor(),
        MyLabel::descriptor(),
        MyPoint64::descriptor(),
        MyIndex::descriptor(),
    ];

    let expected: IntMap<ComponentDescriptor, ()> =
        descriptors.iter().cloned().map(|d| (d, ())).collect();
    let expected: Vec<_> = expected.into_keys().collect();

    for _ in 0..1_000 {
        let got: IntMap<ComponentDescriptor, ()> =
            descriptors.clone().into_iter().map(|d| (d, ())).collect();
        let got: Vec<_> = got.into_keys().collect();

        assert_eq!(expected, got);
    }
}

whereas this one will always fail, as you'd expect:

#[test]
#[allow(clippy::zero_sized_map_values)]
fn hashmap_order_is_well_defined() {
    let descriptors = [
        MyPoint::descriptor(),
        MyColor::descriptor(),
        MyLabel::descriptor(),
        MyPoint64::descriptor(),
        MyIndex::descriptor(),
    ];

    let expected: HashMap<ComponentDescriptor, ()> =
        descriptors.iter().cloned().map(|d| (d, ())).collect();
    let expected: Vec<_> = expected.into_keys().collect();

    for _ in 0..1_000 {
        let got: HashMap<ComponentDescriptor, ()> =
            descriptors.clone().into_iter().map(|d| (d, ())).collect();
        let got: Vec<_> = got.into_keys().collect();

        assert_eq!(expected, got);
    }
}

And that's why batching still works: the order in which you insert your components into your PendingRow is always the same during the execution of your program.
Not only it works, but it's 50% faster than with the BTree-based approach, as we know. Eh.

github-actions · 2025-01-23T12:08:52Z

Web viewer built successfully. If applicable, you should also test it:

I have tested the web viewer

Result	Commit	Link	Manifest
✅	`5f99f4a`	https://rerun.io/viewer/pr/8788	`+nightly` `+main`

^{Note: This comment is updated whenever you push a commit.}

emilk · 2025-01-23T12:24:17Z

Interesting! Makes me want to enable clippy::iter_over_hash_type.

The purpose of the hashing is

split the micro batches even further -- one sub-batch per unique set of datatypes.

Doesn't that mean that two PendingRows with the same datatypes inserted in a different order, will get different hashes, making the rest of the code not be as efficient at batching?

give PendingRow its BTreeMap back

5f99f4a

teh-cmc added 💬 discussion do-not-merge Do not merge this PR exclude from changelog PRs with this won't show up in CHANGELOG.md 🪵 Log & send APIs Affects the user-facing API for all languages labels Jan 23, 2025

teh-cmc changed the title ~~Give PendingRow its BTreeMap back~~ Give PendingRow its BTreeMap back... or don't? 😶 Jan 23, 2025

teh-cmc marked this pull request as draft January 23, 2025 17:34

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Give `PendingRow` its `BTreeMap` back... or don't? 😶 #8788

Give `PendingRow` its `BTreeMap` back... or don't? 😶 #8788

teh-cmc commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading

emilk commented Jan 23, 2025

Give PendingRow its BTreeMap back... or don't? 😶 #8788

Are you sure you want to change the base?

Give PendingRow its BTreeMap back... or don't? 😶 #8788

Conversation

teh-cmc commented Jan 23, 2025 • edited Loading

github-actions bot commented Jan 23, 2025 • edited Loading

emilk commented Jan 23, 2025

Give `PendingRow` its `BTreeMap` back... or don't? 😶 #8788

Give `PendingRow` its `BTreeMap` back... or don't? 😶 #8788

teh-cmc commented Jan 23, 2025 •

edited

Loading

github-actions bot commented Jan 23, 2025 •

edited

Loading