Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

optimize modulo calculation #214

Open
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

chris-ha458
Copy link
Contributor

changing this order will make it much faster especially for larger arrays

changing this order will make it much faster especially for larger arrays
@chris-ha458
Copy link
Contributor Author

re : #212

How do you want me to handle xxhash importing?

@chris-ha458
Copy link
Contributor Author

I'll focus on general minhash improvements here instead of alternative hashing schemes for this pr

@ekzhu ekzhu self-requested a review August 7, 2023 04:27
@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

changing this order will make it much faster especially for larger arrays

Thanks! Does this change lead to different hash values?

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

re : #212

How do you want me to handle xxhash importing?

See my comment in: #212 (comment).

We let user choose their hash functions through MinHash's init parameter.

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

I'll focus on general minhash improvements here instead of alternative hashing schemes for this pr

Thanks! I think as a next step, we can think about a good abstraction for users to specify the permutation calculation.

@chris-ha458
Copy link
Contributor Author

i have been working on this on a private branch (as well as another implementation with different project goals) but i am having some difficulty making the changes while respecting the needs and goals of this repo.
My thoughts was just showing you the modified version not as a commits to minhash.py but rather as a different file minhashX.py or similar.

Do you prefer I still make the commits directly on minhash.py or maybe show it to you separately?
(my goals is to still bring most if not all changes into minhash.py eventually)

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

i have been working on this on a private branch (as well as another implementation with different project goals) but i am having some difficulty making the changes while respecting the needs and goals of this repo. My thoughts was just showing you the modified version not as a commits to minhash.py but rather as a different file minhashX.py or similar.

Do you prefer I still make the commits directly on minhash.py or maybe show it to you separately? (my goals is to still bring most if not all changes into minhash.py eventually)

Thanks for your contribution. I think whichever way works for you is the best. Please take my comments as suggestion only.

This shouldn't diverge but they do.
although the intermediate calculations are meaningful to be calculated with higher dtypes, the final calculated hashes should be stored and operated upon with the same amount of bits as the original has (32bits)
@chris-ha458
Copy link
Contributor Author

I don't know why i've operated on the assumption that dividend mod Mersenne prime = dividend bitwise_and mersenne prime
It does look like it might be right but atleast, changing the mod with bitwise and causes tests to fail.

anyway, even without that assumption there are improvements to be made.

since we start with sha1_32hash which only has 32 bits, and ensure that the resulting numbers have only 32 bits or less (np.bitwise_and with maxhash) there is no need for the numbers to keep np.uint64 after permutations have been done.

I think i can also get away with having a,b permutation array as np.uint64 but I put them into separate commits.

initial value/mask can definitely be np.uint32 as well
maxhash appropriate sized datatype
@chris-ha458
Copy link
Contributor Author

As you can see this is very rudimentary. There are some assumptions that are practical yet imperfect.
One is the assumption that datatypes will be 32bit. You have shared that you wish the user should be able to choose.
How would you like (me) to implement that option?

Copy link
Owner

@ekzhu ekzhu left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good! I am wondering if we can have a "playground" inside test to verify the equivalance of different permutation code. e.g., the old version vs. the new one but result in the same hash values.

@ekzhu
Copy link
Owner

ekzhu commented Aug 7, 2023

We can focus on the performance improvement over existing permutation calculation in this PR. We can open a separate PR for customization and decide there.

@chris-ha458
Copy link
Contributor Author

Focusing on performance improvements sounds nice.

If these changes prove fruitful we can explore customizations options as necessary.

As for playground that sounds like a good idea.

Text-dedup repo uses pinecone dataset for precision,recall,F1 accuracy measurements and runtime.

It might be a better idea to pause performance improvements until we have a good benchmark to compare with.

That is a much larger effort though, and I don't know how you would like to proceed. Do you have any thoughts or pointers on how to implement a benchmark here?

why use the mersenne prime if we are not going to use the mersenne trick?
@chris-ha458
Copy link
Contributor Author

so. this needs some explanation.

The reason mersenne primes are used in the first place is that they have a very unique property
if M = 2^k - 1 is prime ( a mersenne prime) the following is true:
n modulo M == n (bitwise_and) M + n >>(bitwise rightshift) k
when Universal classes of hash functions paper was written, modulo was orders of magnitude slower than bitwise operations. As such, directly calculating the left hand was much slower than right hand.

Thanks to modern vectorized instructions and byte based primitives this is less true.
Most of my testing showed positive results but whether that will be true in general and under various conditions needs to be seen ( the playground / benchmark idea would be useful here)

If you feel that this code is too ugly, unperformant, unmaintainable, etc that is 100% fine.

Then, my argument becomes : there is no reason to use mersenne primes in that case. we can use larger primes
(2**64 - 59) for instance. We will have exactly the same performance as before and scrape back the lost range due to the modulo mersenne prime being smaller than it could be by a factor of 8

@ekzhu
Copy link
Owner

ekzhu commented Aug 9, 2023

Focusing on performance improvements sounds nice.

If these changes prove fruitful we can explore customizations options as necessary.

As for playground that sounds like a good idea.

Text-dedup repo uses pinecone dataset for precision,recall,F1 accuracy measurements and runtime.

It might be a better idea to pause performance improvements until we have a good benchmark to compare with.

That is a much larger effort though, and I don't know how you would like to proceed. Do you have any thoughts or pointers on how to implement a benchmark here?

We have a microbenchmark for MinHash in the benchmark directory, which can be further improved with more realistic workloads -- much like those you have been using. https://github.com/ekzhu/datasketch/blob/master/benchmark/sketches/minhash_benchmark.py

@ekzhu
Copy link
Owner

ekzhu commented Aug 9, 2023

Then, my argument becomes : there is no reason to use mersenne primes in that case. we can use larger primes
(2**64 - 59) for instance. We will have exactly the same performance as before and scrape back the lost range due to the modulo mersenne prime being smaller than it could be by a factor of 8

This is an interesting case to make. I think we can benefit from a simple test module in test/test_hashing.py to play around with various hashing and modulo calculation.

As for the code change, I think if it helps with performance to use this new trick over np.mod than it can be the default. I am mostly worried about backward compatibility.

@chris-ha458
Copy link
Contributor Author

As for the code change, I think if it helps with performance to use this new trick over np.mod than it can be the default. I am mostly worried about backward compatibility.

  1. np.uint32 change -> numerically there is complete backward compatibility, but due to type change, compatibility is broken for practical usages(especially for loading a previously picked minhash and using it). except probably changing types of max_hash but its redundant since we recast back to np.uint64 anyway
  2. mersenne trick -> if we decide to maintain np.uint64 there is no backward compatibility issue. it is bit by bit the same including type. performace gain is marginal though (this trick was developed before modern CPUs or autovectorization over data structures)
  3. (not implemented yet ) keeping np.mod but using higher primes : although compatible datatype wise, all hash values, including when we keep the seed same are not preserved. it would not be backwards compatible.
  4. changing minor ops (things like using np.full instead of np.ones, changing overloaded ops to explicit function calls etc) bit by bit the same and marginal performance benefits (it does skip some python dispatches, but those are really fast anyway)

I have thought about finding a way to preserve backward compatibility and performance through selections, but it seems very difficult. It is not that the technical side is difficult, but the code becomes very difficult to read and understand what is happening.

I think that is why i return to my original proposal. we keep the backwards compatible minhash.py as it is, and add something like minhash_faster.py(I don't insist on the name. it could be minhash_incompatible.py or anything else.) that aggressively applies accelerations in both calculations and datatypes that we can apply without backwards compatibility as a goal.

Meanwhile, I will try to implement the test/test_hashing.py you discuss.
It would capture the core computation of the permutation process
phv = np.bitwise_and(np.mod((a * hv + b), _mersenne_prime), _max_hash)
and change each computation and datatypes and compare their numerical results and speed.

@ekzhu
Copy link
Owner

ekzhu commented Aug 17, 2023

Thanks for your input. This is very valuable. I think we can choose simple alternative option in MinHash constructor (e.g., fast_perm: bool = False), when it is True, it uses all the advanced changes you proposed, including using largest prime. For the original code path, we simply apply all the backward compatible changes that improve the performance.

This way there is only two code paths: (1) one code path to aggressively adapt to the latest permutation trick, without worrying about backward compatibility; and (2) classic minhash permutation algorithm, backward compatible, but performance optimized.

In the documentation we can explain the option as breaking changes for older minhashes -- "use with care".

@ekzhu
Copy link
Owner

ekzhu commented Nov 29, 2023

@chris-ha458 are you still interested in this PR?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants