Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add a batch iterator for the Vamana Indexes #64

Merged
merged 2 commits into from
Dec 14, 2024
Merged

Conversation

ibhati
Copy link
Contributor

@ibhati ibhati commented Dec 13, 2024

This implements the groundwork for a batch iterator for the low-level Vamana indexes. Conceptually, the batch iterator allows searches to be restarted, returning a new batch of k nearest neighbors that have not yet been yielded.

The batcher iterator provides C++ iterator interfaces begin() and end() over a buffer of new IDs. To yield new neighbors, search is effectively restarted with the search window size and search buffer capacity incremented by k, ensuring at least k new IDs can be obtained that were not part of previous searches. Filtering is performed post-search to ensure that unique ID's are available on the next calls to begin() and end().

There are some low-hanging performance wins to be had.

  • The current implementation basically restarts search every single time. Kick-starting search with previously yielded neighbors is probably a good idea, but is currently lacking an API on the index side.
  • The search scratchspace is currently not cached between runs. This means that calls to next() must refix the query and allocate scratchspace as needed. The scratchspace is currently lacking an API to enable adapting to new search parameters. Once this is in place, caching should be straightforward to implement.

Some other thoughts

Single search for the dynamic index is awkward. Single search uses the provided scratchspace and result are extracted from this scratchspace post-search. However, the dynamic index uses different internal and external IDs with different bit-widths. This makes it impossible to reuse the search buffer to store translated IDs. Currently, the iterator needs to detect if ID translation is required and perform translation manually. Options are:

  1. Augment the scratchspace with room for translated neighbors. I don't like this because (A) this would not be used for batch searches and (B) might result in excess moving around of data which makes me sad.
  2. Use some kind of lazy iterator to translate the ID's as they are extracted from the scratch space. I think this approach is cool, but requires the scratchspace to contain a reference to the ID translation struct, which does not feel like the move to me.
  3. Require single-search to provide a destination span into which results will be written and making our current version of "search" for single queries more internal.

Of these, I like 3 the most as it has nice symmetry with our existing batch search functions.

Remaining Tasks

  • Progressive output generation for benchmarking (saving results as they become available).
  • Add documentation page for the iterator and schedules.
  • Add APIs for the iterator to assign a new query and reset the internal state.
  • Detect when an iterator has exhausted the elements in the index, provide an API for detecting this, and use this information to short-circuit searches that would return no neighbors anyway.
  • Allow the iterator to be constructed without kick-starting an initial search.
  • Enable Python bindings support for batch iterator
  • Eager error checking for number of groundtruth elements in benchmarking.

@ibhati ibhati requested review from mihaic and aguerreb December 13, 2024 17:43
@ibhati ibhati merged commit 9401ea9 into main Dec 14, 2024
7 checks passed
@ibhati ibhati deleted the ib/batch-iterator branch December 14, 2024 02:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants