Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

limit common prefix in jaro-winkler #58

Merged
merged 2 commits into from
Jan 5, 2024
Merged

limit common prefix in jaro-winkler #58

merged 2 commits into from
Jan 5, 2024

Conversation

maxbachmann
Copy link
Member

The paper for Jaro-Winkler only allows the common prefix to be boosted for the first 4 characters. This can be found in Winklers paper about the Jaro-Winkler similarity (https://files.eric.ed.gov/fulltext/ED325505.pdf). More specifically in

The new string comparator metric basically modifies the basic string comparator according to whether the first few
characters in the strings being compared agree. Specifically, for i=1,2,3,4,

Results still differ from the ones I get in RapidFuzz, since I only boost the score if the Jaro similarity is above 0.7. The paper mentions:

Weight adjustments are only performed for values of cD greater than 0.60. Values below 0.60 are generally associated with pairs of strings associated with nonmatches in U.

This suggests a limit of 0.7. However there is a C implementation from Winkler which only boosts them starting at 0.7 here: https://web.archive.org/web/19990822155334/http://www.census.gov/geo/msb/stand/strcmp.c

I think so far I have only seen implementations which either always boost, or boost starting at 0.7.

@maxbachmann
Copy link
Member Author

This fixes #53

@maxbachmann
Copy link
Member Author

maxbachmann commented Dec 31, 2023

With the second commit it now has the same behavior.

@@ -4,9 +4,14 @@ This project attempts to adhere to [Semantic Versioning](http://semver.org).

## [Unreleased]

### Changed

- only boost similarity in Jaro-Winkler once the Jaro similarity exceeds 0.7
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Would it be worth mentioning this in the README and/or the function documentation? Since you said that you've also seen implementations that always boost.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes that probably makes sense.

Copy link
Member

@dguo dguo left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll leave it to you to merge this or #67 first.

@maxbachmann maxbachmann merged commit 20e3bdd into main Jan 5, 2024
7 checks passed
@maxbachmann maxbachmann deleted the jaro_winkler branch January 5, 2024 15:40
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants