Remove regex dependency from uri module #739

giacomocavalieri · 2024-11-16T17:53:51Z

This PR closes #734

~~This version that's using pop_grapheme is about 10x slower than the implementation that uses a regex, which is unfortunate.~~
I've managed to make this (thanks to @joshi-monster!!) as fast as the implementation using regexes!!

lpil · 2024-11-17T23:49:04Z

Is this ready? :)

giacomocavalieri · 2024-11-17T23:54:09Z

I saw joshi sharing that the benchmarks were actually worse than what I measured, but from what I can tell they were experimenting with making this faster. I'll fix this tomorrow with the improved method!

joshi-monster · 2024-11-18T01:22:25Z

Hi!!

Here's a diff of the changes I made, based on your pop_grapheme version: https://github.com/giacomocavalieri/stdlib/compare/fix-734...joshi-monster:stdlib:optimise-uri-parse?expand=1

This is a 10x performance improvement for Javascript over the BitArray version and 3 times faster than the previous regex-based one, but is still only half as fast as Erlangs stdlib implementation on that target.

Basically i changed the pop_grapheme calls to string pattern matches, and introduced 2 simple but very low-level ffi functions that let me work with strings on a code unit (UTF-16 surrogate or UTF-8 byte) level.

A suspect that it is slower on Erlang because the pop_codeunit function introduces a fully-qualified call that cannot be inlined, and also causes the bit pattern context object to be thrown away.

It could be made "safer" and potentially faster by sacrificing readability and maintainability. The idea would be to rewrite everything into a single fold_codeunits loop that used a state machine to keep track of everything. This would allow us to use indices instead of slicing on JS, and would allow the Erlang compiler to optimise the loop better.

src/gleam/uri.gleam

giacomocavalieri · 2024-11-18T12:43:46Z

Just a small writeup about performance before this gets merged:

On the javascript target this implementation is as fast (if not a bit faster) than the original implementation that was using regexes
On the Erlang target this implementation is still ~2x slower than uri_string:parse coming from Erlang's stdlib. So, for the moment, we've decided to keep that implementation for the Erlang target and we can always try to make ours more efficient later!

lpil

In my previous testing doing lots of slicing could be somewhat slow on JS. Did we try doing this with byte indexes instead of slicing?

lpil · 2024-11-19T17:17:18Z

src/gleam_stdlib.erl

@@ -203,6 +204,9 @@ string_pop_grapheme(String) ->
        _ -> {error, nil}
    end.

+string_pop_codeunit(<<Cp/integer, Rest/binary>>) -> {Cp, Rest};


This doesn't pop a codepoint, it pops a byte. Are we sure this is doing the right thing?

It pops a code unit, which means a unit of storage based on the encoding. So it pops a byte on Erlang because strings are UTF-8 encoded, or 16 bit (one "index increment") on Javascript.

I believe this is safe to do for these reasons:

the "interesting" characters we match on are all in the ASCII range (#, ?, :, etc)

surrogate pairs / utf-8 codepoints > 127 are all encoded using only code units with values > 127, so we will never accidentally match on a code unit when we shouldn't

any other sequence of code units is just skipped over, so this all means we never break apart code points.

Additionally, according to rfc3986 only the ASCII range is allowed in URIs, other characters should be escaped (punycode, etc)

Codepoints are between 1 and 4 bytes in size on utf8. If we say we are popping codepoints then we should be popping codepoints.

Additionally, according to rfc3986 only the ASCII range is allowed in URIs

Oh wow, no worries at all here then 😁 That's good to know.

Yes, which is why these functions work on code units instead to ensure the byte indices are correct and we can slice in constant time.

Oh wow, no worries at all here then 😁 That's good to know.

Correction: rfc3986 only defines how URIs using only the ASCII range are represented, and recommends that you limit yourself to this subset or use an encoding. It doesn't forbid other characters in general.

lpil

Thank you both!!!

giacomocavalieri force-pushed the fix-734 branch from dc0776e to bee27c5 Compare November 16, 2024 17:55

Remove regex dependency from uri module

d8fea83

giacomocavalieri force-pushed the fix-734 branch from bee27c5 to d8fea83 Compare November 16, 2024 17:56

reomve regex depency 2 electric bugaloo

e88f995

giacomocavalieri marked this pull request as ready for review November 17, 2024 18:01

use string patterns and unsafe binary loops

023437d

giacomocavalieri force-pushed the fix-734 branch from 1b4eb73 to 023437d Compare November 18, 2024 08:59

giacomocavalieri added 2 commits November 18, 2024 10:32

small refactoring

d8883f3

todo comment!

26c4fa5

joshi-monster reviewed Nov 18, 2024

View reviewed changes

src/gleam/uri.gleam Show resolved Hide resolved

giacomocavalieri added 3 commits November 18, 2024 11:23

avoid iterating over the same part twice

19fc82d

remove intermediate data structure and use consistent naming

096ee04

small fix

4147adb

lpil reviewed Nov 19, 2024

View reviewed changes

lpil approved these changes Nov 25, 2024

View reviewed changes

lpil merged commit 55aa559 into gleam-lang:main Nov 25, 2024
7 checks passed

giacomocavalieri deleted the fix-734 branch November 25, 2024 19:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Remove regex dependency from uri module #739

Remove regex dependency from uri module #739

giacomocavalieri commented Nov 16, 2024 •

edited

Loading

lpil commented Nov 17, 2024

giacomocavalieri commented Nov 17, 2024

joshi-monster commented Nov 18, 2024 •

edited

Loading

giacomocavalieri commented Nov 18, 2024

lpil left a comment

lpil Nov 19, 2024 •

edited

Loading

joshi-monster Nov 19, 2024

joshi-monster Nov 19, 2024 •

edited

Loading

lpil Nov 19, 2024 •

edited

Loading

joshi-monster Nov 19, 2024 •

edited

Loading

lpil left a comment

Remove regex dependency from uri module #739

Remove regex dependency from uri module #739

Conversation

giacomocavalieri commented Nov 16, 2024 • edited Loading

lpil commented Nov 17, 2024

giacomocavalieri commented Nov 17, 2024

joshi-monster commented Nov 18, 2024 • edited Loading

giacomocavalieri commented Nov 18, 2024

lpil left a comment

Choose a reason for hiding this comment

lpil Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

joshi-monster Nov 19, 2024

Choose a reason for hiding this comment

joshi-monster Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

lpil Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

joshi-monster Nov 19, 2024 • edited Loading

Choose a reason for hiding this comment

lpil left a comment

Choose a reason for hiding this comment

giacomocavalieri commented Nov 16, 2024 •

edited

Loading

joshi-monster commented Nov 18, 2024 •

edited

Loading

lpil Nov 19, 2024 •

edited

Loading

joshi-monster Nov 19, 2024 •

edited

Loading

lpil Nov 19, 2024 •

edited

Loading

joshi-monster Nov 19, 2024 •

edited

Loading