Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Using pre-computed knn with preserve_neighbors #65

Open
jlmelville opened this issue Sep 17, 2022 · 5 comments
Open

Using pre-computed knn with preserve_neighbors #65

jlmelville opened this issue Sep 17, 2022 · 5 comments

Comments

@jlmelville
Copy link

Hello, is it possible to use preserve_neighbors with pre-computed k-nearest neighbors data in the distance-matrix-like form used by the likes of UMAP and t-SNE? In this format, the indices and distances are stored as separate matrices of shape n_objects, n_neighbors. The element (i, j) in the index matrix is the index of the jth nearest neighbor of i, and the equivalent element in the distance matrix is the distance between those two objects.

I (think) I can work out how to manually convert that into the edges/weights form used by pymde.Graph (various acts of melting and raveling), but I don't get the results I expect compared to passing the data directly. I think that's because when preprocess.generic.k_nearest_neighbors is called, for the graph code path, graph.k_nearest_neighbors has graph_distances=True. So another way to put it is: is there a way to pass a Graph so that it is interpreted as a distance matrix? This would be helpful because calculating the k-nearest neighbors is usually the slowest part of this sort of dimensionality reduction method.

@akshayka
Copy link
Member

@jlmelville, sorry for the late response. Yes, with PyMDE, you can use pre-computed k-nearest neighbor data, with the caveat that you'll need to manually wire up the MDE problem (instead of calling preserve_neighbors).

This notebook has an example that shows how to make a neighborhood preserving embedding from scratch. Start reading from the section called "Embeddings, from scratch", which shows how to create the k-nearest neighbor graph:

https://github.com/cvxgrp/pymde/blob/main/examples/mnist.ipynb

Then, the code for creating a neighborhood-preserving embedding starts in the section called "Including dissimilar pairs".

For more in-depth documentation:

Hope that helps!

@jlmelville
Copy link
Author

@akshayka thanks for pointing me to notebook. I think from looking at the functions called there, if I want to make my own Graph from a precomputed knn, and ignoring any complications due to maximum distances or custom weights, I would do something like the following for k = 15 (all of this stolen shamelessly from the pymde source code):

import numpy as np
import pymde
import pynndescent
import scipy.sparse as sp
from pymde.preprocess.graph import Graph

mnist = pymde.datasets.MNIST()

# assume we use pynndescent for the purposes of this example
# we just need a precomputed n_items x n_neighbors numpy array of indices from somewhere
mnist_nnd = pynndescent.NNDescent(
    mnist.data,
    n_neighbors=16,
    max_candidates=60,
).neighbor_graph

# ignore the "self" neighbors
idx = mnist_nnd[0][:, 1:]

# create the edge list
n_items = idx.shape[0]
n_neighbors = idx.shape[1]
items = np.arange(n_items)
items = np.repeat(items, n_neighbors)
edges = np.stack([items, idx.flatten()], axis=1)

# canonicalize edges
flip_idx = edges[:, 0] > edges[:, 1]
edges[flip_idx] = np.stack([edges[flip_idx][:, 1], edges[flip_idx][:, 0]], axis=1)

# create a sparse matrix
rows = edges[:, 0]
cols = edges[:, 1]
weights = np.ones(edges.shape[0], dtype=np.float32)
graph = sp.coo_matrix((weights, (rows, cols)), shape=(n_items, n_items))

# symmetrize
graph = graph + graph.T
graph = graph.tocsr()

graph = Graph(graph)

then I can use graph.edges to pass to e.g. the edges parameter of pymde.MDE. Does that seem right?

@akshayka
Copy link
Member

Something like that does look right! Were you able to get it working?

This is probably a common enough use case that we should consider including a helper function in the library to automate it.

@jlmelville
Copy link
Author

Yes thanks! I was able to make it work, although on top of a helper function to generate a Graph from nearest neighbors indices, preserve_neighbors itself has some useful functionality around e.g. initialization so refactoring that function to allow a precomputed Graph would make a big difference. Looking at it, it might involve a fair bit of extra logic or extra parameters to account for that, and I can imagine testing for any corner cases might not be worth the effort if no-one else has ever needed it.

@akshayka
Copy link
Member

Great! Thanks for reporting back.

At the very least, I will keep this issue open so others can chime in if they'd like this functionality as well.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants