Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Avoid per-byte loop in cstring{,Utf8} builders #569

Open
wants to merge 9 commits into
base: master
Choose a base branch
from

Conversation

vdukhovni
Copy link
Contributor

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter of the available buffer space and the "null-free" portion of the remaining string. For the UTF8 version, encoded NUL bytes are located via strstr(3).

@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch 2 times, most recently from 96880aa to 266d6da Compare January 13, 2023 04:30
@vdukhovni
Copy link
Contributor Author

The emulated CI build failures are spurious/systemic, not related to the PR.

If I add a couple of new benchmarks that use somewhat longer string literals in builders:

--- a/bench/BenchAll.hs
+++ b/bench/BenchAll.hs
@@ -259,6 +259,8 @@ main = do
         , benchB' "UTF-8 String"  () $ \() -> P.cstringUtf8 "hello world\0"#
         , benchB' "String (naive)" "hello world!" fromString
         , benchB' "String"        () $ \() -> P.cstring "hello world!"#
+        , benchB' "AsciiLit64"   () $ \() -> P.cstring "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
+        , benchB' "Utf8Lit64"   () $ \() -> P.cstringUtf8 "xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx\xc0\x80xxxxxxxxxxxxxxxxxxxxxxxxxxxxxxx"#
         ]
 
       , bgroup "Encoding wrappers"

The relevant benchmark results (GHC 9.4.5) are:

$ cabal run bytestring-bench -- --baseline baseline-lit-9.4.csv --csv new-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.43s)
        278  ns ±  19 ns, 66% less than baseline
      Utf8Lit64:  OK (1.72s)
        356  ns ±  23 ns, 58% less than baseline

All 2 tests passed (3.19s)

The baseline master branch run was:

$ cabal run bytestring-bench -- --csv baseline-lit-9.4.csv -p '/Lit64/'
Up to date
All
  Data.ByteString.Builder
    Small payload
      AsciiLit64: OK (1.07s)
        832  ns ±  79 ns
      Utf8Lit64:  OK (1.06s)
        846  ns ±  75 ns

All 2 tests passed (2.16s)

@clyring
Copy link
Member

clyring commented Jan 13, 2023

Thanks for this. I was also looking into this but hadn't pushed anywhere public because I didn't want to give myself another excuse to delay 0.11.4.0.

I agree the CI failures look spurious. The i386 CI job is currently broken, but I've retried hoping the others will pass.

Your cstring_step does more or less the same thing as byteStringCopyStep in Builder.Internal.

I will take a closer look later.

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The branching logic can potentially be simplified some. Currently we ask:

  1. Are we done?
  2. Is there a null to decode?
  3. Is the output buffer full?
  4. Are there any non-nulls to copy?

But we can also ask only:

  1. Is there a null to decode? (If we are done, the answer will be no.)
  2. Does the decoded string up to and including that null to decode fit in the output buffer? (If not, copy as much as possible and report a full buffer.)

That would mean we perform extra zero-length memcpys in some cases, particularly when there are consecutive (encoded) nulls, so it's not a clear win a priori. But it may be worth investigating.

Data/ByteString/Internal.hs Outdated Show resolved Hide resolved
Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved
Data/ByteString/Builder/Prim.hs Outdated Show resolved Hide resolved
@vdukhovni vdukhovni force-pushed the chunky-cstring-builder branch 3 times, most recently from 9086b60 to e6cc4a2 Compare January 14, 2023 10:42
@chessai
Copy link
Member

chessai commented Jan 15, 2023

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

@vdukhovni
Copy link
Contributor Author

nitpick: could Ptr "\xc0\x80"# be some top-level constant? it's used in two places and is kind of a "magic" string

Sure. Done. I do hope we won't forget to squash before merging...

Data/ByteString/Builder.hs Outdated Show resolved Hide resolved
@vdukhovni
Copy link
Contributor Author

If there's anything further I need to do, please let me know...

Copy link
Member

@clyring clyring left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I've been a bit sidetracked the last few weeks, sorry.

How is performance affected for strings consisting mostly of null characters? If this patch hurts it some, that's probably OK, but I'd like to know roughly by how much.

Data/ByteString/Builder/Internal.hs Show resolved Hide resolved
!op' = op0 `plusPtr` (nullFree + 1)
nullAt' <- c_strstr ip' modifiedUtf8NUL
modUtf8_step ip' len' nullAt' k (BufferRange op' ope)
| avail > 0 = do
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same question, but also avail == 0 should be a very rare case.

@Bodigrim
Copy link
Contributor

Bodigrim commented Feb 8, 2023

@vdukhovni please rebase to trigger updated CI jobs.

@vdukhovni
Copy link
Contributor Author

@vdukhovni please rebase to trigger updated CI jobs.

Done.

Copy link
Contributor

@Bodigrim Bodigrim left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM module naming nitpicking!

@vdukhovni could you possibly address @clyring's questions?

Data/ByteString/Builder/Internal.hs Outdated Show resolved Hide resolved
-- | GHC represents @NUL@ in string literals via an overlong 2-byte encoding,
-- which is part of "modified UTF-8" (GHC does not also implement CESU-8).
modifiedUtf8NUL :: CString
modifiedUtf8NUL = Ptr "\xc0\x80"#
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
modifiedUtf8NUL = Ptr "\xc0\x80"#
modUtf8NUL = Ptr "\xc0\x80"#

Let's keep the prefix consistent.

@clyring clyring modified the milestones: 0.11.5.0, 0.12.1.0 Jul 6, 2023
@clyring
Copy link
Member

clyring commented Sep 27, 2023

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

@vdukhovni
Copy link
Contributor Author

ping @vdukhovni

Do you plan to come back to this patch? Would you like to pass this off to a maintainer?

It's basically ready, right. There were just some cosmetic issues that perhaps a maintainer could tweak to suite their preference and I can review the result? Does that work?

Copy chunks of the input to the output buffer with 'memcpy', up to the shorter
of the available buffer space and the "null-free" portion of the remaining
string.  For the UTF8 version, encoded NUL bytes are located via strstr(3).
@vdukhovni
Copy link
Contributor Author

Perhaps I can get this over the line. What remains to be done?

Data/ByteString/Builder/Internal.hs Outdated Show resolved Hide resolved
-- available buffer space. If the string is long enough, we may have asked
-- for less than its full length, filling the buffer with the rest will go
-- into the next builder step.
| avail > nullFree = do
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could you please check with hpc that tests provide sufficient coverage of all cases here? (Sorry, I'm AFK and cannot check myself)

@vdukhovni
Copy link
Contributor Author

This PR is languishing. Where do we go from here?

@clyring
Copy link
Member

clyring commented Jan 21, 2024

The main questions I had were the ones I raised in this round of review. I've just started to look into them myself since I'd really like this patch to land eventually.

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

@vdukhovni
Copy link
Contributor Author

Another idea that has since occurred to me is that since 0xC0 never occurs in valid UTF-8 (since it is only useful for overlong encodings of ASCII characters), it may be faster to look for the 0xC0 0x80 sequence using memchr instead of strstr.

Perhaps, though one might expect that an optimised C-library strstr already does the equivalent of memchr to find the start of a potential match... And indeed that's what happens in the glibc implementation

@clyring
Copy link
Member

clyring commented Feb 11, 2024

Heads-up: I'll probably push some updates and finishing touches to this branch later today or tomorrow.

@clyring
Copy link
Member

clyring commented Feb 15, 2024

The small-builder benchmarks were set up in a terrible way that made using them to investigate performance very difficult. My recent push should hopefully fix that.

The magic noinline id just isn't available with ghc-8.0...
@@ -84,6 +84,8 @@ module Data.ByteString.Builder.Internal (
-- , sizedChunksInsert

, byteStringCopy
, asciiLiteralCopy
, modUtf8LitCopy
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
, modUtf8LitCopy
, modUtf8LiteralCopy

For consistency with asciiLiteralCopy (or we might as well chose to use Lit for both)

@clyring
Copy link
Member

clyring commented Feb 15, 2024

Here's what the benchmarks currently look like on my machine with ghc-9.8.1:

Baseline (3ce0346):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.7 ns ± 810 ps
      toLazyByteString mempty:        OK
        452  ns ±  24 ns
      empty (10000 times):            OK
        126  μs ± 4.0 μs
      ensureFree 8:                   OK
        16.6 ns ± 624 ps
      intHost 1:                      OK
        25.7 ns ± 1.2 ns
      UTF-8 String (12B, naive):      OK
        101  ns ± 1.6 ns
      UTF-8 String (12B):             OK
        104  ns ±  35 ns
      UTF-8 String (64B, naive):      OK
        311  ns ±  16 ns
      UTF-8 String (64B):             OK
        356  ns ± 9.7 ns
      UTF-8 String (64B, half nulls): OK
        501  ns ±  15 ns
      UTF-8 String (64B, all nulls):  OK
        335  ns ±  12 ns
      String (12B, naive):            OK
        122  ns ± 2.5 ns
      String (12B):                   OK
        86.8 ns ± 3.1 ns
      String (64B, naive):            OK
        327  ns ±  17 ns
      String (64B):                   OK
        279  ns ±  11 ns

Topic (2603009):

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.5 ns ± 830 ps,       same as baseline
      toLazyByteString mempty:        OK
        452  ns ±  27 ns,       same as baseline
      empty (10000 times):            OK
        133  μs ± 5.8 μs,  5% more than baseline
      ensureFree 8:                   OK
        16.9 ns ± 844 ps,       same as baseline
      intHost 1:                      OK
        25.5 ns ± 818 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        489  ns ±  29 ns, 383% more than baseline
      UTF-8 String (12B):             OK
        61.2 ns ± 1.5 ns, 41% less than baseline
      UTF-8 String (64B, naive):      OK
        2.36 μs ±  82 ns, 657% more than baseline
      UTF-8 String (64B):             OK
        61.4 ns ± 3.0 ns, 82% less than baseline
      UTF-8 String (64B, half nulls): OK
        563  ns ±  21 ns, 12% more than baseline
      UTF-8 String (64B, all nulls):  OK
        765  ns ±  24 ns, 128% more than baseline
      String (12B, naive):            OK
        499  ns ±  12 ns, 310% more than baseline
      String (12B):                   OK
        24.7 ns ± 3.3 ns, 71% less than baseline
      String (64B, naive):            OK
        2.37 μs ± 108 ns, 623% more than baseline
      String (64B):                   OK
        23.3 ns ± 1.3 ns, 91% less than baseline

Lots of big changes. Some are expected:

  • The ASCII memcpy implementation is much faster except perhaps on trivially small strings: ~70% less run-time on "hello world!" and ~90% less run-time on the 64-byte case.
  • For modified UTF-8 strings without many embedded nulls, the new implementation is much faster. But it suffers when there are many embedded nulls, roughly breaking even at half-nulls and taking ~2.3x as long when there are only nulls. (Perhaps it would make sense to directly check if the first byte is 0xC0 before calling the C search function, to reduce this regression's magnitude a little.)

But there's also a nasty surprise:

  • The benchmarks for the "naive" case (where rewriting to the CString/Add# versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new {-# NOINLINE stringUtf8 #-} the primMapListBounded in stringUtf8 only gets one argument and primMapListBounded needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of primMapListBounded and see if that fixes this.

I also wanted to see how the memchr implementation compares with the strstr implementation. It seems they're about the same. Here are the memchr numbers, with strstr as the "baseline":

All
  Data.ByteString.Builder
    Small payload
      mempty:                         OK
        15.4 ns ± 456 ps,       same as baseline
      toLazyByteString mempty:        OK
        445  ns ±  21 ns,       same as baseline
      empty (10000 times):            OK
        124  μs ± 6.3 μs,  6% less than baseline
      ensureFree 8:                   OK
        17.7 ns ± 658 ps,       same as baseline
      intHost 1:                      OK
        25.8 ns ± 484 ps,       same as baseline
      UTF-8 String (12B, naive):      OK
        490  ns ±  11 ns,       same as baseline
      UTF-8 String (12B):             OK
        62.7 ns ± 1.6 ns,       same as baseline
      UTF-8 String (64B, naive):      OK
        2.40 μs ±  84 ns,       same as baseline
      UTF-8 String (64B):             OK
        64.4 ns ± 1.7 ns,       same as baseline
      UTF-8 String (64B, half nulls): OK
        616  ns ±  24 ns,  9% more than baseline
      UTF-8 String (64B, all nulls):  OK
        839  ns ± 349 ns,       same as baseline
      String (12B, naive):            OK
        500  ns ±  16 ns,       same as baseline
      String (12B):                   OK
        26.3 ns ± 356 ps,       same as baseline
      String (64B, naive):            OK
        2.28 μs ±  68 ns,       same as baseline
      String (64B):                   OK
        22.1 ns ± 752 ps,       same as baseline

@clyring
Copy link
Member

clyring commented Feb 15, 2024

* The benchmarks for the "naive" case (where rewriting to the `CString`/`Add#` versions is not possible) have regressed by a huge amount. I think I know why this happens: Thanks to the new `{-# NOINLINE stringUtf8 #-}` the `primMapListBounded` in `stringUtf8` only gets one argument and `primMapListBounded` needs two arguments to inline. Ugh! I'll try reducing the syntactic arity of `primMapListBounded` and see if that fixes this.

I have confirmed that reducing the syntactic arity of primMapListBounded fixes this regression.

@clyring clyring modified the milestones: 0.12.1.0, 0.12.2.0 Feb 15, 2024
@Bodigrim
Copy link
Contributor

Removing milestone for now.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants