Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ENH: use functools.lru_cache to speed up #292

Closed
Zeroto521 opened this issue Nov 22, 2022 · 1 comment
Closed

ENH: use functools.lru_cache to speed up #292

Zeroto521 opened this issue Nov 22, 2022 · 1 comment

Comments

@Zeroto521
Copy link

Zeroto521 commented Nov 22, 2022

Levenshtein distance algorithm can't be vectorized.
So the calculation would be very slow in large data.

An idea to speed up is using the cache.
Use the accumulate case to show the cache.

def accumulate(x):
    return sum(range(x))

accumulate(100000000) needs 5s no matter if it is the first time running or the second time running in my local without lru_cache.

from functools import lru_cache

@lru_cache
def accumulate(x):
    return sum(range(x))

After adding lru_cache, the first running accumulate(100000000) still needs 5s.
But the second time running accumulate(100000000) needs 0s.

closed to seatgeek/thefuzz#42

@maxbachmann
Copy link
Member

For large amounts of data you should not directly call the corresponding scorer:

from rapidfuzz.distance import Levenshtein

for choice in choices:
    Levenshtein.distance(query, choice)

instead you should use the process module:

from rapidfuzz import process
from rapidfuzz.distance import Levenshtein

process.cdist([query], choices, scorer=Levenshtein.distance)

I do not think duplicates of both query and choice are very common. So chances are they would be evicted from the cache before you reach a duplicate.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants