Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Implement order preserving transform for inner product search #25

Merged
merged 14 commits into from
Oct 6, 2023

Conversation

dylanrb123
Copy link
Contributor

@dylanrb123 dylanrb123 commented Sep 29, 2023

As described in this paper this is a mechanism to reduce the maximum inner-product search problem into a nearest-neighbors search problem by adding an extra dimension to each vector which preserves the triangle inequality.

Per the paper:

The triangle inequality does not hold between vectors x,
yi, and yj when an inner product compares them, as is the
case in MIP. Many efficient search data structures rely on
the triangle inequality, and if MIP can be transformed to
NN with its Euclidian distance, these data structures would
immediately become applicable. Our first theorem states
that MIP can be reduced to NN by having an Euclidian
metric in one more dimension than the original problem.

This extra dimension is equal to sqrt(max_norm**2 - norm(x)**2) for each vector x in the dataset, where max_norm is the maximum norm of all vectors in the dataset.

One thing to note on the implementation in this library, since Voyager is not exclusively meant to be used in batch and allows adding new items after the index was initially built, there is no way of knowing what the maximum norm for the dataset will be since the dataset is unknown at build time. As such, we simply calculate the extra dimension based on the data that we have seen so far. This means that if you add a new vector with a larger norm than anything seen so far, the accuracy of the index will suffer. This is similar to the approach taken by Vespa, see their blog post on the matter here. If you have a priori knowledge of your dataset it is recommended that you insert the item with the largest norm first.

Addresses #19

@dylanrb123 dylanrb123 marked this pull request as ready for review October 4, 2023 02:45
cpp/TypedIndex.h Outdated Show resolved Hide resolved
@dylanrb123 dylanrb123 requested a review from psobot October 5, 2023 21:27
@dylanrb123 dylanrb123 merged commit ef07b9a into main Oct 6, 2023
52 checks passed
@dylanrb123 dylanrb123 deleted the order-preserving-transform branch October 6, 2023 13:37
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants