We read every piece of feedback, and take your input very seriously.
To see all available qualifiers, see our documentation.
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
functools.lru_cache
Levenshtein distance algorithm can't be vectorized. So the calculation would be very slow in large data.
An idea to speed up is using the cache. Use the accumulate case to show the cache.
accumulate
def accumulate(x): return sum(range(x))
accumulate(100000000) needs 5s no matter if it is the first time running or the second time running in my local without lru_cache.
accumulate(100000000)
lru_cache
from functools import lru_cache @lru_cache def accumulate(x): return sum(range(x))
After adding lru_cache, the first running accumulate(100000000) still needs 5s. But the second time running accumulate(100000000) needs 0s.
closed to seatgeek/thefuzz#42
The text was updated successfully, but these errors were encountered:
For large amounts of data you should not directly call the corresponding scorer:
from rapidfuzz.distance import Levenshtein for choice in choices: Levenshtein.distance(query, choice)
instead you should use the process module:
from rapidfuzz import process from rapidfuzz.distance import Levenshtein process.cdist([query], choices, scorer=Levenshtein.distance)
I do not think duplicates of both query and choice are very common. So chances are they would be evicted from the cache before you reach a duplicate.
query
choice
Sorry, something went wrong.
No branches or pull requests
Levenshtein distance algorithm can't be vectorized.
So the calculation would be very slow in large data.
An idea to speed up is using the cache.
Use the
accumulate
case to show the cache.accumulate(100000000)
needs 5s no matter if it is the first time running or the second time running in my local withoutlru_cache
.After adding
lru_cache
, the first runningaccumulate(100000000)
still needs 5s.But the second time running
accumulate(100000000)
needs 0s.closed to seatgeek/thefuzz#42
The text was updated successfully, but these errors were encountered: