-
Notifications
You must be signed in to change notification settings - Fork 11
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Also for images,files, etc. #6
Comments
That's an interesting thought. Chapter Three of Mining Massive Datasets gives some great insights into generalized locality sensitive hashing functions, and provides quick guides on how one can construct minhashes for objects in different metric spaces — e.g. instead of composing a minhash for a world in which we use Jaccard Similarity as the "distance" measure, we can compose minhashes for a world that uses Euclidean distance. The volume doesn't discuss LSH for images if memory serves, but could help one set the mind in that direction. I hadn't seen discussion of LSH in the context of images until you raised this question, but it seems there is work in this area [1] [2]. Features in an LSH for images could evidently be keypoints (such as SIFT provides), values from a downsampled image (such as the perceptual hash provides), or values from a convolutional filter (such as the early layer of a convnet would provide). I'm a little time-strapped at the moment as I'm finishing my dissertation, but if you try out some of these ideas I'd love to see what you come up with! |
@JannikZed I wanted to add another quick thought. If you're looking for a quick solution that helps you measure image similarity, you could try using PixPlot, which will give you vectorized representations of each of your input images and then visualize the images in an interactive WebGL scene [example scene]. If you "centered" the values in each of the 2048 dimensional vectors you get from PixPlot so each dimension had domain {0:1}, then quantized the centered vectors by rounding float precision to e.g. two decimal places, you could treat the resulting vectors as the hash signature that minhash would compose for a given image. This approach is less research oriented, but I wanted to throw it out there in case you just want to try out something ready-to-hand... |
That sounds really good! I also found the papers you cited among some others that used LSH especially for the image similarity purpose (especially this one sounds pretty intersting: https://medium.com/@Pinterest_Engineering/detecting-image-similarity-using-spark-lsh-and-tensorflow-618636afc939) |
@JannikZed that sounds great. I've been doing related work to identify copyright infringements among visual materials, and have found that much hinges on exactly the criteria by which one identifies two images as being "similar". For some models, two photographs of the same person from different angles capture a "similar" image. For others, only the same photograph with two different crops or post-processing filters should be considered similar... If you haven't yet, you might want to check out Siamese neural networks, as lots of folks are doing work detecting similar images with those. I myself have an open question on an optimal non-Siamese neural network architecture for capturing similar images. If you get any good leads on this front, I'd love to hear any insights you might have! I'll also say that the current PixPlot WebGL viewer code on GitHub gets bogged down with large image collections. I've since written some shader code with which I've visualized 1M images at 60fps on a standard 2015 MacBook Pro. If you push your images through the PixPlot processing code and email me, I'd be happy to share the shader code if you want to try it on your huge image collection. I have a feeling you'll need a pretty specialized graphics card to work with 15M images with an acceptable frame rate in a browser though--that's really going big! |
Hi,
I was wondering, if minhash might be also able to create a locality sensitive hash for data files like pictures or any other binary data file?
The text was updated successfully, but these errors were encountered: