-
Notifications
You must be signed in to change notification settings - Fork 137
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
replace python-Levenshtein with rapidfuzz #10
Conversation
Please merge! |
@bigtoast thoughts on this one? Seems legit but I'm no export on the fuzz 👮🏽 . |
@josegonzalez @bigtoast any news on this? |
Updated the pr assuming Python 2.7 will be dropped at some point. It does now use the pure python fallback of rapidfuzz. This has the following advantages:
|
@kbasnayake-seatgeek @Hi-Tech-SeatGeek @josegonzalez @bigtoast |
@protux what exactly is broken with these versions of Levenshtein? Do you refer to the same issue as #37? In this case the upgrade mechanism of pip does not work properly. You can fix this in your environment using one of the following methods:
|
Any news on this? |
No it appears seatgeek does not really monitor this repository either. Apparently they only wanted to rename it. Note that this PR converts thefuzz to a very simple compatibility wrapper around rapidfuzz, since there are some very small differences between the two. However for most projects it would be better to simply replace their dependency on |
As an update regarding this PR: I am the new maintainer of
|
@josegonzalez I rebased this on the latest master now that the other PRs got merged |
self.assertEqual(fuzz.partial_token_sort_ratio(self.s10, self.s10a, full_process=False), 67) | ||
self.assertEqual(fuzz.partial_token_sort_ratio(self.s10a, self.s10, full_process=False), 67) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The original implementation did have different results for these two tests (50 / 67), since it only allowed alignments behind the string, but not in the front. This is fixed here.
@maxbachmann Thanks for your input! For the time being, we'll be switching to |
test_thefuzz.py
Outdated
if scorer in {fuzz.token_set_ratio, fuzz.partial_token_set_ratio, fuzz.WRatio}: | ||
self.assertEqual(scorer('', ''), 0) | ||
else: | ||
self.assertEqual(scorer('', ''), 100) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This was already the case in thefuzz
beforehand. It is unclear to me whether this was done on purpose, or whether this is a bug.
I'm on vacation but will merge this in and create a major release based off of it when I'm back. |
ping @josegonzalez |
I did now write a compatible wrapper for
both using
Since this is a complete rewrite of the affected functions, the only function not rewritten by this PR is |
As a note this breaks, since rapidfuzz does not support generators yet. I will add support for this, since this seems generally useful |
I basically rewrote So from what I can tell there is nothing stopping us from changing the license now. |
You must always close a file after opening it. The current code might not work on Windows + PyPy, since it's assuming the garbage collector will run (which it won't on PyPy) & the file descriptor will be released (which is required on Windows).
Okay, so why is this still open? :) |
To my understanding Seatgeek doesn't have the capacity to maintain thefuzz/fuzzywuzzy at this point. I did offer @josegonzalez and @bigtoast to take over maintenance of the package, which they they wanted to look into. However I did not hear back about this since then (around 1 month ago). |
Sorry I no longer work at SeatGeek, so I cannot help anymore. |
I already thought so, since your seatgeek e-mail address ceased to exist last week ;) I still hope @bigtoast get's back to this at some point (and would still be willing to take over maintenance). At least right now my recommendation for anyone using fuzzywuzzy/thefuzz would be to replace to usage with rapidfuzz. |
For the time being, people can also add to their requirements.txt:
or run:
|
Hi @andrewjkerr 👋 I saw from your bio that you currently work at seatgeek. Could you ask someone from seatgeek's python related devs whether they can do a release of this PR? If there is no capacity, could you give @maxbachmann write permission to https://pypi.org/project/thefuzz? What is your PyPI username @maxbachmann? Then he can either become a maintainer of this repository, or you archive this repo with a notice in the README, and his fork becomes leading? What do you think? |
My PyPI username is maxbachmann as well. I am in contact with Seatgeek in regards to taking over maintenance for both
I would personally prefer taking over the existing repository, since it would allow me to handle existing issues. However I would be fine either way. |
The pull request replaces python-Levenshtein with rapidfuzz, while still keeping the pure python fallback. This change has the following reasons:
python-Levenshtein
, which will not work in Python 3.12 anymorethefuzz
andthefuzz[speedup]
by always usingthefuzz[speedup]
and providing a pure Python fallback which has the same resultsissues this fixes
benchmarks
original_benchmark.txt
new_benchmark.txt