Avoid per-byte loop in cstring{,Utf8} builders #569

vdukhovni · 2023-01-13T04:14:08Z

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).

vdukhovni · 2023-01-13T06:36:36Z

The emulated CI build failures are spurious/systemic, not related to the PR.

If I add a couple of new benchmarks that use somewhat longer string literals in builders:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"

The relevant benchmark results (GHC 9.4.5) are:

$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.43s)
        278  ns ±  19 ns, 66% less than baseline
      Utf8Lit64:  OK (1.72s)
        356  ns ±  23 ns, 58% less than baseline

All 2 tests passed (3.19s)

The baseline master branch run was:

$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.07s)
        832  ns ±  79 ns
      Utf8Lit64:  OK (1.06s)
        846  ns ±  75 ns

All 2 tests passed (2.16s)

clyring · 2023-01-13T13:05:19Z

Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.

I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.

Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.

I will take a closer look later.

clyring

The branching logic can potentially be simplified some. Currently we ask:

Are we done?
Is there a null to decode?
Is the output buffer full?
Are there any non-nulls to copy?

But we can also ask only:

Is there a null to decode? (If we are done, the answer will be no.)
Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)

That would mean we perform extra zero-length memcpys in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.

Data/ByteString/Internal.hs

Data/ByteString/Builder/Prim.hs

Data/ByteString/Builder/Internal.hs

chessai · 2023-01-15T17:53:12Z

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

vdukhovni · 2023-01-15T19:20:28Z

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

Sure. Done. I do hope we won't forget to squash before merging...

Data/ByteString/Builder.hs

vdukhovni · 2023-01-23T06:25:41Z

If there's anything further I need to do, please let me know...

clyring

I've been a bit sidetracked the last few weeks, sorry.

How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.

Data/ByteString/Builder/Internal.hs

clyring · 2023-02-08T01:35:14Z

Data/ByteString/Builder/Internal.hs

+            !op' = op0 `plusPtr` (nullFree + 1)
+        nullAt' <- c_strstr ip' modifiedUtf8NUL
+        modUtf8_step ip' len' nullAt' k (BufferRange op' ope)
+    | avail > 0 = do


Same question, but also avail == 0 should be a very rare case.

Bodigrim · 2023-02-08T23:23:32Z

@vdukhovni please rebase to trigger updated CI jobs.

vdukhovni · 2023-02-09T04:45:58Z

@vdukhovni please rebase to trigger updated CI jobs.

Done.

Bodigrim

LGTM module naming nitpicking!

@vdukhovni could you possibly address @clyring's questions?

Data/ByteString/Builder/Internal.hs

Bodigrim · 2023-06-12T21:36:58Z

Data/ByteString/Builder/Internal.hs

+-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding,
+-- which is part of "modified UTF-8" (GHC does not also implement CESU-8).
+modifiedUtf8NUL :: CString
+modifiedUtf8NUL = Ptr "\xc0\x80"#


Suggested change

modifiedUtf8NUL = Ptr "\xc0\x80"#

modUtf8NUL = Ptr "\xc0\x80"#

Let's keep the prefix consistent.

clyring · 2023-09-27T01:50:37Z

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

vdukhovni · 2023-09-27T02:30:01Z

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).

vdukhovni · 2023-10-11T02:02:18Z

Perhaps I can get this over the line. What remains to be done?

Data/ByteString/Builder/Internal.hs

Bodigrim · 2023-10-11T20:40:55Z

Data/ByteString/Builder/Internal.hs

+    -- available buffer space. If the string is long enough, we may have asked
+    -- for less than its full length, filling the buffer with the rest will go
+    -- into the next builder step.
+    | avail > nullFree = do


Could you please check with hpc that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)

vdukhovni · 2024-01-21T00:27:10Z

This PR is languishing. Where do we go from here?

clyring · 2024-01-21T00:43:36Z

The main questions I had were the ones I raised in this round of review. I've just started to look into them myself since I'd really like this patch to land eventually.

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

vdukhovni · 2024-01-21T16:00:29Z

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

Perhaps, though one might expect that an optimised C-library strstr already does the equivalent of memchr to find the start of a potential match... And indeed that's what happens in the glibc implementation

clyring · 2024-02-11T16:28:13Z

Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow.

* Do not measure the overhead of allocating destination chunks * Add several more benchmarks for P.cstring and P.cstringUtf8

(This won't work with -fpure-haskell yet.)

clyring · 2024-02-15T04:08:12Z

The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that.

The magic noinline id just isn't available with ghc-8.0...

hsyl20 · 2024-02-15T10:29:58Z

Data/ByteString/Builder/Internal.hs

@@ -84,6 +84,8 @@ module Data.ByteString.Builder.Internal (
  -- , sizedChunksInsert

  , byteStringCopy
+  , asciiLiteralCopy
+  , modUtf8LitCopy


Suggested change

, modUtf8LitCopy

, modUtf8LiteralCopy

For consistency with asciiLiteralCopy (or we might as well chose to use Lit for both)

clyring · 2024-02-15T14:28:03Z

Here's what the benchmarks currently look like on my machine with ghc-9.8.1:

Baseline (3ce0346):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.7 ns ± 810 ps
      toLazyByteString mempty:        OK
        452  ns ±  24 ns
      empty (10000 times):            OK
        126  μs ± 4.0 μs
      ensureFree 8:                   OK
        16.6 ns ± 624 ps
      intHost 1:                      OK
        25.7 ns ± 1.2 ns
      UTF-8 String (12B, naive):      OK
        101  ns ± 1.6 ns
      UTF-8 String (12B):             OK
        104  ns ±  35 ns
      UTF-8 String (64B, naive):      OK
        311  ns ±  16 ns
      UTF-8 String (64B):             OK
        356  ns ± 9.7 ns
      UTF-8 String (64B, half nulls): OK
        501  ns ±  15 ns
      UTF-8 String (64B, all nulls):  OK
        335  ns ±  12 ns
      String (12B, naive):            OK
        122  ns ± 2.5 ns
      String (12B):                   OK
        86.8 ns ± 3.1 ns
      String (64B, naive):            OK
        327  ns ±  17 ns
      String (64B):                   OK
        279  ns ±  11 ns

Topic (2603009):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.5 ns ± 830 ps,       same as baseline
      toLazyByteString mempty:        OK
        452  ns ±  27 ns,       same as baseline
      empty (10000 times):            OK
        133  μs ± 5.8 μs,  5% more than baseline
      ensureFree 8:                   OK
        16.9 ns ± 844 ps,       same as baseline
      intHost 1:                      OK
        25.5 ns ± 818 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        489  ns ±  29 ns, 383% more than baseline
      UTF-8 String (12B):             OK
        61.2 ns ± 1.5 ns, 41% less than baseline
      UTF-8 String (64B, naive):      OK
        2.36 μs ±  82 ns, 657% more than baseline
      UTF-8 String (64B):             OK
        61.4 ns ± 3.0 ns, 82% less than baseline
      UTF-8 String (64B, half nulls): OK
        563  ns ±  21 ns, 12% more than baseline
      UTF-8 String (64B, all nulls):  OK
        765  ns ±  24 ns, 128% more than baseline
      String (12B, naive):            OK
        499  ns ±  12 ns, 310% more than baseline
      String (12B):                   OK
        24.7 ns ± 3.3 ns, 71% less than baseline
      String (64B, naive):            OK
        2.37 μs ± 108 ns, 623% more than baseline
      String (64B):                   OK
        23.3 ns ± 1.3 ns, 91% less than baseline

Lots of big changes. Some are expected:

The ASCII memcpy implementation is much faster except perhaps on trivially small strings: ~70% less run-time on "hello world!" and ~90% less run-time on the 64-byte case.
For modified UTF-8 strings without many embedded nulls, the new implementation is much faster. But it suffers when there are many embedded nulls, roughly breaking even at half-nulls and taking ~2.3x as long when there are only nulls. (Perhaps it would make sense to directly check if the first byte is 0xC0 before calling the C search function, to reduce this regression's magnitude a little.)

But there's also a nasty surprise:

The benchmarks for the "naive" case (where rewriting to the CString/Add# versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new {-# NOINLINE stringUtf8 #-} the primMapListBounded in stringUtf8 only gets one argument and primMapListBounded needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of primMapListBounded and see if that fixes this.

I also wanted to see how the memchr implementation compares with the strstr implementation. It seems they're about the same. Here are the memchr numbers, with strstr as the "baseline":

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.4 ns ± 456 ps,       same as baseline
      toLazyByteString mempty:        OK
        445  ns ±  21 ns,       same as baseline
      empty (10000 times):            OK
        124  μs ± 6.3 μs,  6% less than baseline
      ensureFree 8:                   OK
        17.7 ns ± 658 ps,       same as baseline
      intHost 1:                      OK
        25.8 ns ± 484 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        490  ns ±  11 ns,       same as baseline
      UTF-8 String (12B):             OK
        62.7 ns ± 1.6 ns,       same as baseline
      UTF-8 String (64B, naive):      OK
        2.40 μs ±  84 ns,       same as baseline
      UTF-8 String (64B):             OK
        64.4 ns ± 1.7 ns,       same as baseline
      UTF-8 String (64B, half nulls): OK
        616  ns ±  24 ns,  9% more than baseline
      UTF-8 String (64B, all nulls):  OK
        839  ns ± 349 ns,       same as baseline
      String (12B, naive):            OK
        500  ns ±  16 ns,       same as baseline
      String (12B):                   OK
        26.3 ns ± 356 ps,       same as baseline
      String (64B, naive):            OK
        2.28 μs ±  68 ns,       same as baseline
      String (64B):                   OK
        22.1 ns ± 752 ps,       same as baseline

clyring · 2024-02-15T14:44:22Z

* The benchmarks for the "naive" case (where rewriting to the `CString`/`Add#` versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new `{-# NOINLINE stringUtf8 #-}` the `primMapListBounded` in `stringUtf8` only gets one argument and `primMapListBounded` needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of `primMapListBounded` and see if that fixes this.

I have confirmed that reducing the syntactic arity of primMapListBounded fixes this regression.

Bodigrim · 2024-10-15T22:32:13Z

Removing milestone for now.

vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 96880aa to 266d6da Compare January 13, 2023 04:30

clyring reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Internal.hs Outdated Show resolved Hide resolved

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

vdukhovni force-pushed the chunky-cstring-builder branch 3 times, most recently from 9086b60 to e6cc4a2 Compare January 14, 2023 10:42

clyring reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved

Bodigrim reviewed Jan 14, 2023

View reviewed changes

Data/ByteString/Builder/Internal.hs Outdated Show resolved Hide resolved

clyring mentioned this pull request Jan 15, 2023

0.12.0.0 release planning #573

Closed

vdukhovni commented Jan 15, 2023

View reviewed changes

Data/ByteString/Builder.hs Outdated Show resolved Hide resolved

clyring mentioned this pull request Jan 15, 2023

Test that our rewrite rules and list fusion actually work #574

Open

clyring added this to the 0.11.5.0 milestone Jan 19, 2023

clyring reviewed Feb 8, 2023

View reviewed changes

vdukhovni force-pushed the chunky-cstring-builder branch from 44fdcbc to 0645428 Compare February 9, 2023 04:45

Bodigrim reviewed Jun 12, 2023

View reviewed changes

clyring modified the milestones: 0.11.5.0, 0.12.1.0 Jul 6, 2023

hs-viktor added 2 commits October 9, 2023 16:57

Avoid per-byte loop in cstring{,Utf8} builders

7f1aca0

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).

Cosmetic renames asc -> ascii

01b5f36

vdukhovni force-pushed the chunky-cstring-builder branch from 0645428 to 01b5f36 Compare October 9, 2023 20:57

Bodigrim reviewed Oct 11, 2023

View reviewed changes

This was referenced Feb 14, 2024

Remove remaining uses of FFI under -fpure-haskell clyring/bytestring#2

Closed

Remove remaining uses of FFI under -fpure-haskell #660

Merged

clyring added 5 commits February 14, 2024 22:14

Improve benchmarks for small Builders

3ce0346

* Do not measure the overhead of allocating destination chunks * Add several more benchmarks for P.cstring and P.cstringUtf8

Merge branch 'builder-bench-improvements' into chunky-cstring-builder

f58840b

(This won't work with -fpure-haskell yet.)

Add pure-haskell implementation avoiding strstr

d674964

Update "@SInCE" of new functions to 0.12.1.0

b297904

Add deprecated-since info to docstrings

e1aab36

Use Exts.lazy instead of Exts.noinline

2603009

The magic noinline id just isn't available with ghc-8.0...

hsyl20 reviewed Feb 15, 2024

View reviewed changes

Allow primMapListBounded to inline with one arg

cd02c61

clyring modified the milestones: 0.12.1.0, 0.12.2.0 Feb 15, 2024

Bodigrim approved these changes Feb 15, 2024

View reviewed changes

clyring mentioned this pull request Jun 5, 2024

Improve benchmarks for small Builders #680

Merged

clyring mentioned this pull request Jun 26, 2024

Fix several bugs around the 'byteString' family of Builders #671

Merged

Bodigrim removed this from the 0.12.2.0 milestone Oct 15, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Avoid per-byte loop in cstring{,Utf8} builders #569

Avoid per-byte loop in cstring{,Utf8} builders #569

vdukhovni commented Jan 13, 2023

vdukhovni commented Jan 13, 2023

clyring commented Jan 13, 2023

clyring left a comment

chessai commented Jan 15, 2023

vdukhovni commented Jan 15, 2023

vdukhovni commented Jan 23, 2023

clyring left a comment

clyring Feb 8, 2023

Bodigrim commented Feb 8, 2023

vdukhovni commented Feb 9, 2023

Bodigrim left a comment

Bodigrim Jun 12, 2023

clyring commented Sep 27, 2023

vdukhovni commented Sep 27, 2023

vdukhovni commented Oct 11, 2023

Bodigrim Oct 11, 2023

vdukhovni commented Jan 21, 2024

clyring commented Jan 21, 2024

vdukhovni commented Jan 21, 2024

clyring commented Feb 11, 2024

clyring commented Feb 15, 2024

hsyl20 Feb 15, 2024

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

Bodigrim commented Oct 15, 2024

	modifiedUtf8NUL = Ptr "\xc0\x80"#
	modUtf8NUL = Ptr "\xc0\x80"#

Avoid per-byte loop in cstring{,Utf8} builders #569

Are you sure you want to change the base?

Avoid per-byte loop in cstring{,Utf8} builders #569

Conversation

vdukhovni commented Jan 13, 2023

vdukhovni commented Jan 13, 2023

clyring commented Jan 13, 2023

clyring left a comment

Choose a reason for hiding this comment

chessai commented Jan 15, 2023

vdukhovni commented Jan 15, 2023

vdukhovni commented Jan 23, 2023

clyring left a comment

Choose a reason for hiding this comment

clyring Feb 8, 2023

Choose a reason for hiding this comment

Bodigrim commented Feb 8, 2023

vdukhovni commented Feb 9, 2023

Bodigrim left a comment

Choose a reason for hiding this comment

Bodigrim Jun 12, 2023

Choose a reason for hiding this comment

clyring commented Sep 27, 2023

vdukhovni commented Sep 27, 2023

vdukhovni commented Oct 11, 2023

Bodigrim Oct 11, 2023

Choose a reason for hiding this comment

vdukhovni commented Jan 21, 2024

clyring commented Jan 21, 2024

vdukhovni commented Jan 21, 2024

clyring commented Feb 11, 2024

clyring commented Feb 15, 2024

hsyl20 Feb 15, 2024

Choose a reason for hiding this comment

clyring commented Feb 15, 2024

clyring commented Feb 15, 2024

Bodigrim commented Oct 15, 2024