Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Create linear-search-optimized HNSW #12

Open
vadixidav opened this issue Sep 1, 2019 · 0 comments
Open

Create linear-search-optimized HNSW #12

vadixidav opened this issue Sep 1, 2019 · 0 comments

Comments

@vadixidav
Copy link
Member

Some optimizations can be performed to massively speed up HNSW by utilizing linear search on clusters and in levels of NSW that are small enough that linear search is faster. This will also result in lower memory consumption. In the current architecture, these optimizations require a linear-time operation to:

  • Order every level so that the indices are the same
    • This allows performing incremental linear search on every level until we reach a level where the number of features is too large.
  • Cluster features below a certain level to their nearest neighbor in a higher level to permit linear search in that cluster
    • This allows performing linear search at the bottom of the graph in clusters.

This will make it so that the "small worlds" in the NSW are clusters that can be searched linearly, which is incredibly fast on modern computers.

It is not possible to efficiently maintain this during operation because every insertion would require a linear time operation to invalidate all directed edges connected to a node that must be moved (although they can be moved in log(N) time). We could make an alternate version of HNSW that does allow the first optimization in log(N) by utilizing the heap to store neighbors pointing into a given node which wouldn't affect the performance of normal searching, but would slow insertion some (although the overall speed improvement may make insertion faster).

For the second optimization, it isn't clear exactly what the semantics of insertion would be if you were to maintain clusters at the bottom of the HNSW for speed, but it is clear how you would convert an existing HNSW into a clustered version. If we could do it, allowing insertion into the clustered version would be game-changing, but it would require knowing how to split up a cluster once it becomes too big into a NSW. One idea would be to store clusters and nodes at each level of the graph and then once a cluster (which is always a leaf of the graph) needs to be broken up, one random feature is chosen from the cluster to become a new cluster center and all features are separated into two clusters based on which of the two centers they are closest to. The performance characteristics of this approach are unknown and must be benchmarked.

We should attempt to create static versions of HNSW that are built from a dynamic version in linear time complexity and also a dynamic version of HNSW that contains both of the above optimizations in their dynamic form. The two resulting data structures should be benchmarked to see if there is a benefit to the static version, otherwise the dynamic version should be preferred since the memory consumption difference should be negligable and so performance is the main factor.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

1 participant