Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

rewrite Damerau levenshtein #57

Closed
wants to merge 1 commit into from
Closed

Conversation

maxbachmann
Copy link
Member

@maxbachmann maxbachmann commented Dec 27, 2023

This reimplements the damerau levenshtein distance based on the paper Linear space string correction algorithm using the Damerau-Levenshtein distance from Chunchun Zhao and Sartaj Sahni.

Currently this improves the following things:

  • memory usage improved from O(N*M) to O(N+M)
  • binary size is pretty much the same when optimizing for size, but worse when optimizing for runtime
  • runtime in our benchmark reduced from 7500ns to 6600ns

There are a couple of open decisions:

hashmap

Right now this uses a normal Hashmap internally. In Rapidfuzz I use a custom hashmap implementation. This is based on the hashmap implementation used inside cpython which uses open addressing. However it removes a couple of features:

  • can't remove elements -> no tombstones
  • uses a custom hash function which is an identity hash and requires them to have no collisions
  • uses lookup table for ascii keys which are more likely
  • value types need to be defaultable

This has a couple of advantages / disadvantages:
+ significantly faster especially for ascii

  • 1400ns vs 6600ns for ascii
  • 3100ns vs 6600ns for unicode
    not really clear how both of them perform if you try to create more hash collisions.

+ little code -> in a small example it has around 40% smaller binary size when switching the hashmap in damerau levenshtein
- more code to maintain
- has a custom hashfunction which assumes that there can't be hash collisions. This works fine for basic types like integers or chars, but this assumption would break if a user wants to compare e.g. lists of strings.
- hashmap could be smaller if nothing is inlined + it is used in multiple places

interface

we should probably think about how we want the interface for generic functions to be as consistent as possible. Right now a lot of them have different interfaces.

This reimplements the damerau levenshtein distance based on the paper
`Linear space string correction algorithm using the Damerau-Levenshtein distance`
from Chunchun Zhao and Sartaj Sahni.

This improves the memory usage from O(N*M) to O(N+M)
@maxbachmann maxbachmann marked this pull request as draft December 27, 2023 17:18
@dguo
Copy link
Member

dguo commented Dec 31, 2023

The existing improvements are great!

For the hashmap decision, the speedup is certainly compelling. Just throwing this out there: would it make sense to add the custom hashmap but only use it for the string version? And continue to use the normal hashmap for the generic version to avoid having to make the assumption of no hash collisions.

For making the generic functions more consistent, I'm happy to defer to you about the best way to do that.

@maxbachmann
Copy link
Member Author

Having different implementations for types that provide the custom hashing function and types that only implement the normal hashing function could be interesting.

Since this is going to change the interface of the generic functions I want to make another release beforehand anyways, so people can get just the fixes without any breaking changes.

@dguo
Copy link
Member

dguo commented Dec 31, 2023

Sounds good. Should I go ahead and cut a release for the Jaro/Jaro-Winkler fixes, or do you have other changes to make first?

@maxbachmann
Copy link
Member Author

After #58 and #60 is probably a good time for a new release.

@maxbachmann
Copy link
Member Author

#61 supersedes this PR. I will still keep the branch around for when I take another look at this in the future.

@maxbachmann maxbachmann closed this Jan 1, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants