-
Notifications
You must be signed in to change notification settings - Fork 54
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Implement order preserving transform for inner product search #25
Merged
Conversation
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
As described in this paper this is a mechanism to reduce the maximum inner-product search problem into a nearest-neighbors search problem by adding an extra dimension to each vector which preserves the triangle inequality.
Per the paper:
This extra dimension is equal to
sqrt(max_norm**2 - norm(x)**2)
for each vectorx
in the dataset, wheremax_norm
is the maximum norm of all vectors in the dataset.One thing to note on the implementation in this library, since Voyager is not exclusively meant to be used in batch and allows adding new items after the index was initially built, there is no way of knowing what the maximum norm for the dataset will be since the dataset is unknown at build time. As such, we simply calculate the extra dimension based on the data that we have seen so far. This means that if you add a new vector with a larger norm than anything seen so far, the accuracy of the index will suffer. This is similar to the approach taken by Vespa, see their blog post on the matter here. If you have a priori knowledge of your dataset it is recommended that you insert the item with the largest norm first.
Addresses #19