ENH: use `functools.lru_cache` to speed up #292

Zeroto521 · 2022-11-22T01:59:32Z

Levenshtein distance algorithm can't be vectorized.
So the calculation would be very slow in large data.

An idea to speed up is using the cache.
Use the accumulate case to show the cache.

def accumulate(x):
    return sum(range(x))

accumulate(100000000) needs 5s no matter if it is the first time running or the second time running in my local without lru_cache.

from functools import lru_cache

@lru_cache
def accumulate(x):
    return sum(range(x))

After adding lru_cache, the first running accumulate(100000000) still needs 5s.
But the second time running accumulate(100000000) needs 0s.

closed to seatgeek/thefuzz#42

The text was updated successfully, but these errors were encountered:

maxbachmann · 2022-11-26T18:33:29Z

For large amounts of data you should not directly call the corresponding scorer:

from rapidfuzz.distance import Levenshtein

for choice in choices:
    Levenshtein.distance(query, choice)

instead you should use the process module:

from rapidfuzz import process
from rapidfuzz.distance import Levenshtein

process.cdist([query], choices, scorer=Levenshtein.distance)

I do not think duplicates of both query and choice are very common. So chances are they would be evicted from the cache before you reach a duplicate.

maxbachmann closed this as completed Nov 26, 2022

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ENH: use `functools.lru_cache` to speed up #292

ENH: use `functools.lru_cache` to speed up #292

Zeroto521 commented Nov 22, 2022 •

edited

Loading

maxbachmann commented Nov 26, 2022

ENH: use functools.lru_cache to speed up #292

ENH: use functools.lru_cache to speed up #292

Comments

Zeroto521 commented Nov 22, 2022 • edited Loading

maxbachmann commented Nov 26, 2022

ENH: use `functools.lru_cache` to speed up #292

ENH: use `functools.lru_cache` to speed up #292

Zeroto521 commented Nov 22, 2022 •

edited

Loading