Indexing without storing signatures #28

ZJaume · 2023-07-27T10:16:24Z

ZJaume
Jul 27, 2023

Hi,

First of all many thanks for this library, it is helping me a lot. I'm using it to do neardedup with MinHash of very large collections of text (tens of TB or even a hundred TB, compressed size) and I'm constrained by the amount of RAM available. So, I'm doing some modifications to address this. The first one, was to have index objects that store only one of the bands, so I could do distributed index in different machines. But now I'm wondering if it would be possible to avoid storing all the signatures in id_signatures: HashMap. Therefore only storing ids. As far as I understood from the code, to be able to query a document and return matches, band is needed and the id_signatures would only be needed if return similarity is requested or if I need to do queries by id. Am I right?

Not asking for you to implement it, just wanted to double check if this is feasible.

Thanks in advance,
Jaume

serega · 2023-07-27T11:24:09Z

serega
Jul 27, 2023
Maintainer

Hello Jaume. Glad gaoya is helping you.
I am using gaoya myself in production, but my data sets of smaller. They fit in few tens of GB. If you do not need similarity and do not query by id and you are ok with a small percentage of false positives then it is OK to have a distributed index. id_signatures also used to filter candidates returned from bands by the threshold. See here and here
Each band stores only a small portion of the full minhash signature. It is possible to have two very different signatures that would hash into exactly same location in the band. datasketch - another very popular minhash implementation does not store full signatures, which may result in false positives. See comment from the author here and here
I do not have any data to tell the percentage of false positives, but it is small. MinHashing itself is a probabilistic algorithm, so even with full signatures there can be false positives.

Short answer is - yes, it is feasible

0 replies

ZJaume · 2023-07-27T11:54:59Z

ZJaume
Jul 27, 2023
Author

Many thanks for your explanation! I will probably explore this possibility.

0 replies

ZJaume · 2023-10-31T10:48:19Z

ZJaume
Oct 31, 2023
Author

Hi again,

I finally implemented distributed index, storage without signatures and compute connected components of duplicates with Union-Find.
Basically my implementation is doing the same as text-dedup but with Rust and Gaoya.
With it, I was able to deduplicate 13B documents of English with 1TB RAM nodes for the latest HPLT release.

Probably my code is breaking a little bit the design of the library (I'm not well versed in Rust) or going out of its scope, but if there is any change that you are interested in, I'm happy to contribute.

0 replies

serega · 2023-11-04T09:09:53Z

serega
Nov 4, 2023
Maintainer

Hi @ZJaume. I am not sure where you committed. Do you have a fork that is public ?

0 replies

ZJaume · 2023-11-04T18:25:46Z

ZJaume
Nov 4, 2023
Author

Sorry, I don't know why the links were wrong. The commits are in my fork, large_dedup branch:
ZJaume@d0c54b3
and
ZJaume@ebeb181

0 replies

serega · 2023-11-12T09:29:00Z

serega
Nov 12, 2023
Maintainer

Hi @ZJaume . I finally found time to look at your branch. You may have noticed I have clustering folder with a clustering algorithm implemented. In fact, I use a variation of clusterer_parallel.rs with great success. The algorithm proceeds through all points and calls query on the index to find matches.
If I understand correctly the idea of Union-Find is to take advantage of the MinHash index creates a structure that sort of pre-clusters data. Every point is stored in every band b0, b1,b2, ..bB. All points in a single band entry are part of the same cluster by the construction. We start with band b0 and entry b0[0], which may have points [p0, p3, p19]. These three points belong the same cluster. The point p0 may be stored alone in B1 as B1[8]: [p0], and stored in b2 as b2[5]: [p0, p9], so we add p9 the cluster. The cool thing is that we don't have to chase points. Instead we just go though all tables and call Union-Find on every point.
I think asymptotically both algorithms are very similar. Correct me if I am wrong in my reasoning. Let's take the worst case and use a dataset of size N with no duplicates. We construct the index with B bands with N entries in each band. The naive algorithm I use would iterate through each point p0, p1,.. pN and the corresponding signatures and do a lookup in every band with a portion of the signature. The running time is N * B.
The UnionFind algorithm goes through every band and calls Union-Find on every point, so the running time is B * N. UnionFind should be faster in practice because B * N hash map lookups are more expensive that iteration. UnionFind algorithms also avoids constructing BandKey. It would be interesting to benchmark both algorithms.
The main advantage of UnionFind for distributed clustering is there is no need to store signatures in one place. However, as already mentioned ignoring full signatures may result in a higher false positive rate, but checking against full signatures can be done with post processing.
I am would be happy to have UnionFind algorithm implemented in Gaoya, as long as it is done in a non-intrusive manner, and optional. I don't know though if UnionFind would be more efficient for my use-case. I use Gaoya for continuous clustering of streaming data. MinHashIndex with HashSetContainer is being mutated as new points added to the index, and removed when a cluster is found. I do not cluster the whole dataset on every iteration, and use new points as entry points plus small percentage of random points to the clustering algorithm.

1 reply

ZJaume Nov 13, 2023
Author

Unfortunately, I found your clustering implementation after I did all my work. But anyway, if it needs to query the index it won't work for me. My first implementation of dedup in my pipeline using gaoya was to build an index and then query each document to extract duplicates. There I found that our data (text from HTML webpages crawled by Internet Archive) contains some duplicates clusters that are enormous. For example, for a subset of about a 1/3 of Finnish data I found clusters containing tens of thousands of documents. In case you wonder, I checked, looking at samples of those clusters, that they were indeed duplicates, and they were. So, that size of clusters seemed to be increasing the cost of querying by a lot. Scaling to larger languages was impossible because the time was growing quadratically. Finally, after reading text-dedup code, I decided to implement the current approach.

serega · 2023-11-12T14:39:40Z

serega
Nov 12, 2023
Maintainer

A random idea came to me. To deduplicate a very large dataset using MinHash there is no need a 1TB of RAM. The find_clusters function iterates though all bands in sequential order, so only one band is needed at a time. So, I think it is possible to proceed as follows in pseudocode

Step 1 - create minhash signatures and store them in a file in some format

for (doc_id, doc) in documents
  signature = create_signature(doc);
  write_signature_to_file((doc_id, signature));

Step 2 - create bands one at a time and store band on disk

for b in 0..num_bands:
     band = create_band()
     for (doc_id, signature) in read_doc_id_signatures_from_file():
         band.insert(doc_id,  signature) 
     write_band_on_disk(band);

Step 3 - iterate through every band and and call union find operation on every item

for band in bands:
    for entry in read_band_from_disk(band):
        union_find_entry(entry)

I didn't specify how to store the data on disk, but I don't think it matters much. With serde it is very easy to serialize and deserialize data in virtually any format. bincode works very well.

So, instead of using a machine with 1T RAM it would be possible to deduplicate on a machine with a small fraction of RAM and a bigger disk. As I am writing this I think steps 2 and 3 can be done in a sequential order one band a time.

union_find = UnionFind()
for b in 0..num_bands:
     band = new_band()
     for (doc_id, signature) in read_doc_id_signatures_from_file():
         band.insert(doc_id, signature) 
    for entry in band:
        union_find.union(entry)

For one time deduplication there is no need to keep all bands around. Working with disk files would be slower than working with RAM, but it would be much more cost effective.

1 reply

ZJaume Nov 13, 2023
Author

Well, I see that you found in my code that I was doing something pretty similar to this. It would have been better if I explained all of this before.

Even with the approach of single band storage, I still needed 1TB RAM nodes because our English and Chinese data is quite big. Furthermore, we are increasing the amount of data for the next iteration, so I will need even bigger machines or figure out how to do distributed memory storage of the band hash table.

serega · 2023-11-12T14:46:02Z

serega
Nov 12, 2023
Maintainer

I have just found your project. The original links you gave me were not correct

2 replies

serega Nov 12, 2023
Maintainer

I looked at your project. I think you are actually doing something like what I described.
Indexer creates signatures, indexes them in a single band, runs union find, and outputs the results into a file. Then DedupFilter reads all outputs and runs a global UnionFind on all of them. Brilliant.
Did you try using smaller hashes? I am using 16 bit hashes MinHashIndex with no observed loss in quality. I've been even thinking to switch to u8. This paper shows that for high similarity estimates only few bits is sufficient

ZJaume Nov 13, 2023
Author

I haven't tried using smaller hashes, but that is probably something I will try. Indeed, I found that paper but what I understood from is that hashes need to be computed in a different way than gaoya is doing now. Not just using smaller integers. But I will definitely explore this option in the future.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Indexing without storing signatures #28

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 8 comments 4 replies

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Select a reply

Indexing without storing signatures #28

ZJaume Jul 27, 2023

Replies: 8 comments · 4 replies

serega Jul 27, 2023 Maintainer

ZJaume Jul 27, 2023 Author

ZJaume Oct 31, 2023 Author

serega Nov 4, 2023 Maintainer

ZJaume Nov 4, 2023 Author

serega Nov 12, 2023 Maintainer

ZJaume Nov 13, 2023 Author

serega Nov 12, 2023 Maintainer

ZJaume Nov 13, 2023 Author

serega Nov 12, 2023 Maintainer

serega Nov 12, 2023 Maintainer

ZJaume Nov 13, 2023 Author

ZJaume
Jul 27, 2023

Replies: 8 comments 4 replies

serega
Jul 27, 2023
Maintainer

ZJaume
Jul 27, 2023
Author

ZJaume
Oct 31, 2023
Author

serega
Nov 4, 2023
Maintainer

ZJaume
Nov 4, 2023
Author

serega
Nov 12, 2023
Maintainer

ZJaume Nov 13, 2023
Author

serega
Nov 12, 2023
Maintainer

ZJaume Nov 13, 2023
Author

serega
Nov 12, 2023
Maintainer

serega Nov 12, 2023
Maintainer

ZJaume Nov 13, 2023
Author