-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Add PyUnicode_Equal()
function
#43
Comments
API shape seems fine, I prefer I find the performance argument fairly compelling, since ordering requires a lot of calculation (and realistically requires a number of options, including locale, to be correct). Having a fast equals method is likely to prevent people from relying on interning strings and comparing pointers. However, if the motivation really is that Knowing whether the people using the private function were looking specifically for speed or just using the first function their IDE told them about would help us make an informed decision here. I don't like when the proposal doesn't actually solve the problem given as the motivation (but am happy to change the motivation in this case). |
Also, we need to rule out |
Not that my opinion matters much here, but I might as well share my thoughts: I think a faster comparison protocol is a great idea (and one I'd be happy to help work on), but probably better suited for Though, maybe it's going a bit far to add this to the stable ABI -- the main beneficiaries from this are people who weren't using the limited API anyway, so really we're just adding another function to do the same thing, but with less version compatibility than |
Both projects use it for handling keyword arguments. I expected either that or attribute names (e.g. implementing descriptor-like behaviour in I think locale-aware comparisons have no place in the +1 for |
As naming,
I ran a benchmark on strings of 10 characters:
It returns |
On strings that are equal? Or strings that are not equal? The distribution of strings makes a big difference here. If they're all the same length and random, then all three functions will have to compare the first character. If they're different lengths, Equal will be faster, and if they share prefixes the characteristics will change again. That said, performance in an equality function typically matters the most for when the strings are not equal, since that's going to be the more common tight loop (i.e. when you find the one that matches, you stop searching). But if you want to say that performance isn't that compelling, then I'll say they can all switch to one of the other functions. This is an easier argument to make if you say that performance is the main reason they need it ;)
You're thinking of |
Victor, compare also To add confusion, there is also much more used (but in less performance-critical code) |
Here are updated benchmarks: python/cpython#124504 (comment) (now with CPU isolation).
_PyUnicode_Equal() is 1.1x faster than _PyUnicode_EQ() (-1.0 ns) ;-) |
How can |
We shoud modify |
Thank you Victor. I think that adding But the key point of Having different requirements and guaranties for |
According to you, what should be the behavior if you compare string to an integer? Return 0? Raise an exception? Fail with a fatal error? What if both arguments are not strings? The important part of the API is:
|
What is the behavior if you compare string to NULL? What is the behavior if you compare non-canonicalized strings (for example with kind=UCS4, but all code points < 0x10000)? This is an undefined behavior. Very few functions check that the argument is not NULL. Functions usually do not check that strings or integers are in canonized form, they imply that this is true. |
@zooba @encukou: @serhiy-storchaka suggests to declare that comparing objects which are not strings is an undefined behavior for best performance. I suggest to check types and return a TypeError in this case. What's your call on this question? |
No undefined behaviour in the limited API. We must check types. (I would lean towards no undefined behaviour in any public API, including unstable APIs, but I'm prepared to consider exceptional cases. Especially in "unstable". But definitely not in the limited API.) |
My benchmark was on equal strings of 10 characters. New benchmark on inequal strings of 10 characters.
|
Not by enough to make me really excited about it. I guess my gut instinct about perf benefits didn't work this time - perhaps there are other (unavoidable?) overheads. |
Sorry :-) Well, there are cases where it's more interesting. Different string lengthstr1 = "1" * 10
str2 = "1"
PyUnicode_Equal() is 3.0x faster than PyUnicode_Compare(). It becomes 7.2x faster for strings of 1000 and 1001 characters. In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n). Different string kinds (UCS-1 and UCS-2)str1 = "1" * 9 + "$"
str2 = "1" * 9 + "\u20ac"
PyUnicode_Equal() is 3.6x faster than PyUnicode_Compare(). It becomes 171.6x faster for strings of 1000 characters. In the case, PyUnicode_Equal() is O(1) and PyUnicode_Compare() is O(n). |
Oh yeah, I forgot about that case (and definitely remembered it the first time around). Okay, consider me excited about the perf benefits again |
Under hood, both Now, from your comparison of As for undefined behavior, all functions in the C API have undefined behavior if pass pointers to a freed memory. Many (including all mentioned here functions) have undefined behavior if pass NULL pointers ( |
I made a small study on PyUnicode methods in the limited C API. On my small study, 1/3 of functions don't check types, whereas 2/3 check types. PyUnicode_GetLength() which is commonly used checks its argument type. Don't check arguments type (7):
Check arguments type (14):
Compared to PyUnicode_Compare(), the difference with PyUnicode_Equal() is that it is O(1) if string lengths are different or if strings kind are different. I'm not sure that 0.4-0.9 ns is a big deal for such API. It's already faster than PyUnicode_Compare() and PyUnicode_RichCompare() in all tested cases. |
@erlend-aasland @zooba: Would you mind to vote? @erlend-aasland: What's your opinion on checking the arguments type? Do you prefer raising TypeError or have an undefined behavior for a little performance overhead? |
Voted in favour (still begrudging the name, but I suspect I'm outvoted on adding
|
Oh right, I just added it to the PR to not forget. |
New API should take We had this conversation around (If necessary, we can add |
Honestly, for less than a nanosecond, I don't think that it's worth it to have two functions. |
Well, it was not a principled objection. |
✅
I'm share Steve's view: I prefer safe APIs over UB. |
All members voted in favor of PyUnicode_Equal(): the API is approved. Thanks everybody. I close the issue. |
I propose to add a public
PyUnicode_Equal(a, b)
function to the limited C API 3.14 to replace the private_PyUnicode_EQ()
function:API:
int PyUnicode_Equal(PyObject *a, PyObject *b)
1
if a is equal to b.0
if a is not equal to b.TypeError
exception and return-1
if a or b is not a Pythonstr
object.Python 3.13 moved the private
_PyUnicode_EQ()
function to internal C API. mypy and Pyodide are using it.The existing
PyUnicode_Compare()
isn't enough and has an issue.PyUnicode_Compare()
returns-1
for "less than" but also for the error case. The caller must callPyErr_Occurred()
which is inefficient. It causes an ambiguous return value: capi-workgroup/problems#1PyUnicode_Equal()
has no such ambiguous return value (-1
only means error). Moreover,PyUnicode_Equal()
may be a little bit faster thanPyUnicode_Compare()
, but I'm not sure about that.Vote to add this API:
The text was updated successfully, but these errors were encountered: