Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Remove regex dependency from uri module #739

Merged
merged 8 commits into from
Nov 25, 2024

Conversation

giacomocavalieri
Copy link
Member

@giacomocavalieri giacomocavalieri commented Nov 16, 2024

This PR closes #734

This version that's using pop_grapheme is about 10x slower than the implementation that uses a regex, which is unfortunate.
I've managed to make this (thanks to @joshi-monster!!) as fast as the implementation using regexes!!

@giacomocavalieri giacomocavalieri marked this pull request as ready for review November 17, 2024 18:01
@lpil
Copy link
Member

lpil commented Nov 17, 2024

Is this ready? :)

@giacomocavalieri
Copy link
Member Author

I saw joshi sharing that the benchmarks were actually worse than what I measured, but from what I can tell they were experimenting with making this faster. I'll fix this tomorrow with the improved method!

@joshi-monster
Copy link
Contributor

joshi-monster commented Nov 18, 2024

Hi!!

Here's a diff of the changes I made, based on your pop_grapheme version: https://github.com/giacomocavalieri/stdlib/compare/fix-734...joshi-monster:stdlib:optimise-uri-parse?expand=1

This is a 10x performance improvement for Javascript over the BitArray version and 3 times faster than the previous regex-based one, but is still only half as fast as Erlangs stdlib implementation on that target.

Basically i changed the pop_grapheme calls to string pattern matches, and introduced 2 simple but very low-level ffi functions that let me work with strings on a code unit (UTF-16 surrogate or UTF-8 byte) level.

A suspect that it is slower on Erlang because the pop_codeunit function introduces a fully-qualified call that cannot be inlined, and also causes the bit pattern context object to be thrown away.

It could be made "safer" and potentially faster by sacrificing readability and maintainability. The idea would be to rewrite everything into a single fold_codeunits loop that used a state machine to keep track of everything. This would allow us to use indices instead of slicing on JS, and would allow the Erlang compiler to optimise the loop better.

@giacomocavalieri
Copy link
Member Author

Just a small writeup about performance before this gets merged:

  • On the javascript target this implementation is as fast (if not a bit faster) than the original implementation that was using regexes
  • On the Erlang target this implementation is still ~2x slower than uri_string:parse coming from Erlang's stdlib. So, for the moment, we've decided to keep that implementation for the Erlang target and we can always try to make ours more efficient later!

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In my previous testing doing lots of slicing could be somewhat slow on JS. Did we try doing this with byte indexes instead of slicing?

@@ -203,6 +204,9 @@ string_pop_grapheme(String) ->
_ -> {error, nil}
end.

string_pop_codeunit(<<Cp/integer, Rest/binary>>) -> {Cp, Rest};
Copy link
Member

@lpil lpil Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This doesn't pop a codepoint, it pops a byte. Are we sure this is doing the right thing?

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It pops a code unit, which means a unit of storage based on the encoding. So it pops a byte on Erlang because strings are UTF-8 encoded, or 16 bit (one "index increment") on Javascript.

Copy link
Contributor

@joshi-monster joshi-monster Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I believe this is safe to do for these reasons:

  • the "interesting" characters we match on are all in the ASCII range (#, ?, :, etc)
  • surrogate pairs / utf-8 codepoints > 127 are all encoded using only code units with values > 127, so we will never accidentally match on a code unit when we shouldn't
  • any other sequence of code units is just skipped over, so this all means we never break apart code points.
  • Additionally, according to rfc3986 only the ASCII range is allowed in URIs, other characters should be escaped (punycode, etc)

Copy link
Member

@lpil lpil Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Codepoints are between 1 and 4 bytes in size on utf8. If we say we are popping codepoints then we should be popping codepoints.

Additionally, according to rfc3986 only the ASCII range is allowed in URIs

Oh wow, no worries at all here then 😁 That's good to know.

Copy link
Contributor

@joshi-monster joshi-monster Nov 19, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, which is why these functions work on code units instead to ensure the byte indices are correct and we can slice in constant time.

Oh wow, no worries at all here then 😁 That's good to know.

Correction: rfc3986 only defines how URIs using only the ASCII range are represented, and recommends that you limit yourself to this subset or use an encoding. It doesn't forbid other characters in general.

Copy link
Member

@lpil lpil left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you both!!!

@lpil lpil merged commit 55aa559 into gleam-lang:main Nov 25, 2024
7 checks passed
@giacomocavalieri giacomocavalieri deleted the fix-734 branch November 25, 2024 19:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Remove regex use from gleam/uri
3 participants