Replies: 8 comments 4 replies
-
Hello Jaume. Glad gaoya is helping you. Short answer is - yes, it is feasible |
Beta Was this translation helpful? Give feedback.
-
Many thanks for your explanation! I will probably explore this possibility. |
Beta Was this translation helpful? Give feedback.
-
Hi again, I finally implemented distributed index, storage without signatures and compute connected components of duplicates with Union-Find. Probably my code is breaking a little bit the design of the library (I'm not well versed in Rust) or going out of its scope, but if there is any change that you are interested in, I'm happy to contribute. |
Beta Was this translation helpful? Give feedback.
-
Hi @ZJaume. I am not sure where you committed. Do you have a fork that is public ? |
Beta Was this translation helpful? Give feedback.
-
Sorry, I don't know why the links were wrong. The commits are in my fork, |
Beta Was this translation helpful? Give feedback.
-
Hi @ZJaume . I finally found time to look at your branch. You may have noticed I have clustering folder with a clustering algorithm implemented. In fact, I use a variation of clusterer_parallel.rs with great success. The algorithm proceeds through all points and calls |
Beta Was this translation helpful? Give feedback.
-
A random idea came to me. To deduplicate a very large dataset using MinHash there is no need a 1TB of RAM. The Step 1 - create minhash signatures and store them in a file in some format
Step 2 - create bands one at a time and store band on disk
Step 3 - iterate through every band and and call union find operation on every item
I didn't specify how to store the data on disk, but I don't think it matters much. With serde it is very easy to serialize and deserialize data in virtually any format. bincode works very well. So, instead of using a machine with 1T RAM it would be possible to deduplicate on a machine with a small fraction of RAM and a bigger disk. As I am writing this I think steps 2 and 3 can be done in a sequential order one band a time.
For one time deduplication there is no need to keep all bands around. Working with disk files would be slower than working with RAM, but it would be much more cost effective. |
Beta Was this translation helpful? Give feedback.
-
I have just found your project. The original links you gave me were not correct |
Beta Was this translation helpful? Give feedback.
-
Hi,
First of all many thanks for this library, it is helping me a lot. I'm using it to do neardedup with MinHash of very large collections of text (tens of TB or even a hundred TB, compressed size) and I'm constrained by the amount of RAM available. So, I'm doing some modifications to address this. The first one, was to have index objects that store only one of the bands, so I could do distributed index in different machines. But now I'm wondering if it would be possible to avoid storing all the signatures in
id_signatures: HashMap
. Therefore only storing ids. As far as I understood from the code, to be able to query a document and return matches, band is needed and theid_signatures
would only be needed if return similarity is requested or if I need to do queries by id. Am I right?Not asking for you to implement it, just wanted to double check if this is feasible.
Thanks in advance,
Jaume
Beta Was this translation helpful? Give feedback.
All reactions