Skip to content

Commit

Permalink
bytes_from_utf8: Copy initial invariants as-is
Browse files Browse the repository at this point in the history
The paradigm used in this commit is in place in several other places in
core.  When dealing with UTF-8, it may well be that the first part of a
string contains only characters that are the same when encoded as UTF-8
as when not.  There is a function that finds the first position in a
string not like that.  It works on a whole word at a time instead of
per-byte, effectively speeding things up by a factor of 8.

In this case, calling that function tells us that we can use memcpy() to
do the initial part of our task, before having to switch to looking at
individual bytes.
  • Loading branch information
khwilliamson committed Oct 21, 2024
1 parent 5a979ea commit 83c2748
Showing 1 changed file with 11 additions and 1 deletion.
12 changes: 11 additions & 1 deletion utf8.c
Original file line number Diff line number Diff line change
Expand Up @@ -2679,12 +2679,22 @@ Perl_bytes_from_utf8_loc(const U8 *s, STRLEN *lenp, bool *is_utf8p, const U8** f
}

const U8 * const s0 = s;
const U8 * send = s + *lenp;
const U8 * const send = s + *lenp;
const U8 * first_variant;

/* The initial portion of 's' that consists of invariants can be Copied
* as-is. If it is entirely invariant, the whole thing can be Copied. */
if (is_utf8_invariant_string_loc(s, *lenp, &first_variant)) {
first_variant = send;
}

U8 *d;
Newx(d, (*lenp) + 1, U8);
Copy(s, d, first_variant - s, U8);

U8 *converted_start = d;
d += first_variant - s;
s = first_variant;

while (s < send) {
U8 c = *s++;
Expand Down

0 comments on commit 83c2748

Please sign in to comment.