Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Give PendingRow its BTreeMap back... or don't? 😶 #8788

Draft
wants to merge 1 commit into
base: main
Choose a base branch
from
Draft

Conversation

teh-cmc
Copy link
Member

@teh-cmc teh-cmc commented Jan 23, 2025

This is a fun one: it's by all account a mistake, and therefore a bug... but actually it's pretty nice, so I'm not quite sure I want to fix it 😶.

I was looking into the micro-batcher's code for unrelated reasons, when I stumbled upon this:

let mut hasher = ahash::AHasher::default();
row.components
    .values()
    .for_each(|array| array.data_type().hash(&mut hasher));

Looks legit at first glance, except... row.components is a Hashmap. Or a IntMap, rather. And HashMaps have a random iteration order per-instance per-execution (it is then fixed for the lifetime of that instance).
The reason it's a IntMap is because I went a bit too far with my search-n-replace during #8207:

So, that raises the question... why on earth is the micro-batcher still working? The reason it's still working is that, by virtue of not doing any hashing nor salting, an IntMap ends up with stronger guarantees that a vanilla HashMap. Specifically, all IntMaps that share the same keys have the same iteration order, iff these keys were inserted in the same order.
Put differently, this test always passes:

#[test]
#[allow(clippy::zero_sized_map_values)]
fn intmap_order_is_well_defined() {
    let descriptors = [
        MyPoint::descriptor(),
        MyColor::descriptor(),
        MyLabel::descriptor(),
        MyPoint64::descriptor(),
        MyIndex::descriptor(),
    ];

    let expected: IntMap<ComponentDescriptor, ()> =
        descriptors.iter().cloned().map(|d| (d, ())).collect();
    let expected: Vec<_> = expected.into_keys().collect();

    for _ in 0..1_000 {
        let got: IntMap<ComponentDescriptor, ()> =
            descriptors.clone().into_iter().map(|d| (d, ())).collect();
        let got: Vec<_> = got.into_keys().collect();

        assert_eq!(expected, got);
    }
}

whereas this one will always fail, as you'd expect:

#[test]
#[allow(clippy::zero_sized_map_values)]
fn hashmap_order_is_well_defined() {
    let descriptors = [
        MyPoint::descriptor(),
        MyColor::descriptor(),
        MyLabel::descriptor(),
        MyPoint64::descriptor(),
        MyIndex::descriptor(),
    ];

    let expected: HashMap<ComponentDescriptor, ()> =
        descriptors.iter().cloned().map(|d| (d, ())).collect();
    let expected: Vec<_> = expected.into_keys().collect();

    for _ in 0..1_000 {
        let got: HashMap<ComponentDescriptor, ()> =
            descriptors.clone().into_iter().map(|d| (d, ())).collect();
        let got: Vec<_> = got.into_keys().collect();

        assert_eq!(expected, got);
    }
}

And that's why batching still works: the order in which you insert your components into your PendingRow is always the same during the execution of your program.
Not only it works, but it's 50% faster than with the BTree-based approach, as we know. Eh.

@teh-cmc teh-cmc added 💬 discussion do-not-merge Do not merge this PR exclude from changelog PRs with this won't show up in CHANGELOG.md 🪵 Log & send APIs Affects the user-facing API for all languages labels Jan 23, 2025
@teh-cmc teh-cmc changed the title Give PendingRow its BTreeMap back Give PendingRow its BTreeMap back... or don't? 😶 Jan 23, 2025
Copy link

github-actions bot commented Jan 23, 2025

Web viewer built successfully. If applicable, you should also test it:

  • I have tested the web viewer
Result Commit Link Manifest
5f99f4a https://rerun.io/viewer/pr/8788 +nightly +main

Note: This comment is updated whenever you push a commit.

@emilk
Copy link
Member

emilk commented Jan 23, 2025

Interesting! Makes me want to enable clippy::iter_over_hash_type.

The purpose of the hashing is

split the micro batches even further -- one sub-batch per unique set of datatypes.

Doesn't that mean that two PendingRows with the same datatypes inserted in a different order, will get different hashes, making the rest of the code not be as efficient at batching?

@teh-cmc teh-cmc marked this pull request as draft January 23, 2025 17:34
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
💬 discussion do-not-merge Do not merge this PR exclude from changelog PRs with this won't show up in CHANGELOG.md 🪵 Log & send APIs Affects the user-facing API for all languages
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants