-
-
Notifications
You must be signed in to change notification settings - Fork 185
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Remove regex dependency from uri module #739
Conversation
dc0776e
to
bee27c5
Compare
bee27c5
to
d8fea83
Compare
Is this ready? :) |
I saw joshi sharing that the benchmarks were actually worse than what I measured, but from what I can tell they were experimenting with making this faster. I'll fix this tomorrow with the improved method! |
Hi!! Here's a diff of the changes I made, based on your This is a 10x performance improvement for Javascript over the BitArray version and 3 times faster than the previous regex-based one, but is still only half as fast as Erlangs stdlib implementation on that target. Basically i changed the A suspect that it is slower on Erlang because the It could be made "safer" and potentially faster by sacrificing readability and maintainability. The idea would be to rewrite everything into a single |
1b4eb73
to
023437d
Compare
Just a small writeup about performance before this gets merged:
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my previous testing doing lots of slicing could be somewhat slow on JS. Did we try doing this with byte indexes instead of slicing?
@@ -203,6 +204,9 @@ string_pop_grapheme(String) -> | |||
_ -> {error, nil} | |||
end. | |||
|
|||
string_pop_codeunit(<<Cp/integer, Rest/binary>>) -> {Cp, Rest}; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This doesn't pop a codepoint, it pops a byte. Are we sure this is doing the right thing?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It pops a code unit, which means a unit of storage based on the encoding. So it pops a byte on Erlang because strings are UTF-8 encoded, or 16 bit (one "index increment") on Javascript.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I believe this is safe to do for these reasons:
- the "interesting" characters we match on are all in the ASCII range (
#
,?
,:
, etc) - surrogate pairs / utf-8 codepoints > 127 are all encoded using only code units with values > 127, so we will never accidentally match on a code unit when we shouldn't
- any other sequence of code units is just skipped over, so this all means we never break apart code points.
- Additionally, according to rfc3986 only the ASCII range is allowed in URIs, other characters should be escaped (punycode, etc)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Codepoints are between 1 and 4 bytes in size on utf8. If we say we are popping codepoints then we should be popping codepoints.
Additionally, according to rfc3986 only the ASCII range is allowed in URIs
Oh wow, no worries at all here then 😁 That's good to know.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, which is why these functions work on code units instead to ensure the byte indices are correct and we can slice in constant time.
Oh wow, no worries at all here then 😁 That's good to know.
Correction: rfc3986 only defines how URIs using only the ASCII range are represented, and recommends that you limit yourself to this subset or use an encoding. It doesn't forbid other characters in general.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you both!!!
This PR closes #734
This version that's usingpop_grapheme
is about 10x slower than the implementation that uses a regex, which is unfortunate.I've managed to make this (thanks to @joshi-monster!!) as fast as the implementation using regexes!!