Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add fast path skipping UTF8 length counting #2819

Open
wants to merge 5 commits into
base: main
Choose a base branch
from
Open

Conversation

gaearon
Copy link
Contributor

@gaearon gaearon commented Sep 14, 2024

Stacked on #2817

Commits

What

Similar to #2817, I'm trying to avoid calling into TextEncoder().encode(str).byteLength for every string. After this change, I basically don't hit it in the app at all — the fast path always lets me out early.

The fast pass itself is pretty general. The idea is that .length counts UTF-16 code units, and each UTF-16 code unit corresponds to at most 3 bytes in UTF-8 encoding. So we can safely use value.length * 3 as an upper bound on what utf8Len(value) could possibly be. If this upper bound is below the minLength, the same is true for utf8Len. If this upper bound is within maxLength, the same is true for utf8Len.

Why * 3?

  • Codepoints that fit into a single UTF-16 code unit become 1 to 3 bytes in UTF-8. (Worst case is 3x.)
  • Codepoints that need two UTF-16 code units become 4 bytes in UTF-8. (Worst case is 2x.)

So .length * 3 should always give us a valid upper bound. But this needs a look from an expert.

I've added some test cases.

@gaearon gaearon changed the title Add fast path for UTF8 length counting Add fast path skipping UTF8 length counting Sep 14, 2024
@bnewbold
Copy link
Collaborator

this seems reasonable, though I should probably re-read more carefully and maybe cook up more corner-cases. I kind of suspect that it won't be as much of a win as the earlier grapheme cluster and utf8 caching patch though? I guess UTF-16 to UTF-8 does cost something through, and this probably does help with the happy path, and we do a lot of these, hrm.

Copy link
Collaborator

@devinivy devinivy left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good thinkin! Re: the factor of 3 in here, I am quite sure that checks out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants