Add a batch iterator for the Vamana Indexes #64

ibhati · 2024-12-13T01:40:55Z

This implements the groundwork for a batch iterator for the low-level Vamana indexes. Conceptually, the batch iterator allows searches to be restarted, returning a new batch of k nearest neighbors that have not yet been yielded.

The batcher iterator provides C++ iterator interfaces begin() and end() over a buffer of new IDs. To yield new neighbors, search is effectively restarted with the search window size and search buffer capacity incremented by k, ensuring at least k new IDs can be obtained that were not part of previous searches. Filtering is performed post-search to ensure that unique ID's are available on the next calls to begin() and end().

There are some low-hanging performance wins to be had.

The current implementation basically restarts search every single time. Kick-starting search with previously yielded neighbors is probably a good idea, but is currently lacking an API on the index side.
The search scratchspace is currently not cached between runs. This means that calls to next() must refix the query and allocate scratchspace as needed. The scratchspace is currently lacking an API to enable adapting to new search parameters. Once this is in place, caching should be straightforward to implement.

Some other thoughts

Single search for the dynamic index is awkward. Single search uses the provided scratchspace and result are extracted from this scratchspace post-search. However, the dynamic index uses different internal and external IDs with different bit-widths. This makes it impossible to reuse the search buffer to store translated IDs. Currently, the iterator needs to detect if ID translation is required and perform translation manually. Options are:

Augment the scratchspace with room for translated neighbors. I don't like this because (A) this would not be used for batch searches and (B) might result in excess moving around of data which makes me sad.
Use some kind of lazy iterator to translate the ID's as they are extracted from the scratch space. I think this approach is cool, but requires the scratchspace to contain a reference to the ID translation struct, which does not feel like the move to me.
Require single-search to provide a destination span into which results will be written and making our current version of "search" for single queries more internal.

Of these, I like 3 the most as it has nice symmetry with our existing batch search functions.

Remaining Tasks

Progressive output generation for benchmarking (saving results as they become available).
Add documentation page for the iterator and schedules.
Add APIs for the iterator to assign a new query and reset the internal state.
Detect when an iterator has exhausted the elements in the index, provide an API for detecting this, and use this information to short-circuit searches that would return no neighbors anyway.
Allow the iterator to be constructed without kick-starting an initial search.
Enable Python bindings support for batch iterator
Eager error checking for number of groundtruth elements in benchmarking.

Move batch iterator code to public svs

f5b3693

ibhati requested review from mihaic and aguerreb December 13, 2024 17:43

aguerreb approved these changes Dec 13, 2024

View reviewed changes

Merge branch 'main' into ib/batch-iterator

210a4da

ibhati merged commit 9401ea9 into main Dec 14, 2024
7 checks passed

ibhati deleted the ib/batch-iterator branch December 14, 2024 02:40

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Add a batch iterator for the Vamana Indexes #64

Add a batch iterator for the Vamana Indexes #64

ibhati commented Dec 13, 2024

Add a batch iterator for the Vamana Indexes #64

Add a batch iterator for the Vamana Indexes #64

Conversation

ibhati commented Dec 13, 2024

Some other thoughts

Remaining Tasks