WIP ARM Neon/AVX2 SIMD implementation. #730

samyron · 2025-01-17T03:32:50Z

WORK IN PROGRESS

Initial implementation providing an ARM Neon (and now AVX2) SIMD implementation of the convert_UTF8_to_JSON* functions.

There is still more work to be done on the convert_UTF8_to_ASCII_only_JSON. Right now it only uses SIMD to skip the prefix of characters that do not need escaping. I will fix that in the coming days.

I started the implementation of x86_64 support. Currently using __m256i seems significantly slower on my machine. I plan to make it configurable.

Additionally, the algorithm between the ARM Neon and x86_64 (AVX2) is the same. This may not be ideal as there may be better instructions on one/either platform which would be more efficient.

Benchmarks

My machine is an M1 Macbook Air with 16MB of RAM.

% ruby -v
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]

I do not have YJIT built.

Baseline: code in `master` (at the time fork was created)

% ruby -Ilib:ext benchmark/encoder-simple.rb 
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.465k i/100ms
                  oj     2.089k i/100ms
Calculating -------------------------------------
                json     14.440k (± 2.5%) i/s   (69.25 μs/i) -     73.250k in   5.075975s
                  oj     20.504k (± 1.9%) i/s   (48.77 μs/i) -    104.450k in   5.096055s

Comparison:
                json:    14439.8 i/s
                  oj:    20503.7 i/s - 1.42x  faster

With SIMD

Benchmark encoding a long string.

% ruby -Ilib:ext benchmark/encoder-simple.rb
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     3.721k i/100ms
                  oj     2.006k i/100ms
Calculating -------------------------------------
                json     39.497k (± 3.9%) i/s   (25.32 μs/i) -    200.934k in   5.094877s
                  oj     20.486k (± 2.6%) i/s   (48.81 μs/i) -    104.312k in   5.095360s

Comparison:
                json:    39497.0 i/s
                  oj:    20486.2 i/s - 1.93x  slower

Before and After (using the before & after script)

== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     4.386k i/100ms
Calculating -------------------------------------
               after     42.103k (± 4.4%) i/s   (23.75 μs/i) -    210.528k in   5.010512s

Comparison:
              before:    14901.3 i/s
               after:    42103.4 i/s - 2.83x  faster

Existing benchmarks

== Encoding small mixed (34 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   431.030k i/100ms
                  oj   425.858k i/100ms
Calculating -------------------------------------
                json      4.443M (± 0.8%) i/s  (225.05 ns/i) -     22.414M in   5.044440s
                  oj      4.240M (± 1.2%) i/s  (235.86 ns/i) -     21.293M in   5.022778s

Comparison:
                json:  4443492.7 i/s
                  oj:  4239867.2 i/s - 1.05x  slower


== Encoding small nested array (121 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   202.481k i/100ms
                  oj   173.531k i/100ms
Calculating -------------------------------------
                json      2.060M (± 3.4%) i/s  (485.48 ns/i) -     10.327M in   5.020528s
                  oj      1.740M (± 1.7%) i/s  (574.79 ns/i) -      8.850M in   5.088416s

Comparison:
                json:  2059819.6 i/s
                  oj:  1739779.8 i/s - 1.18x  slower


== Encoding small hash (65 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   440.390k i/100ms
                  oj   485.263k i/100ms
Calculating -------------------------------------
                json      4.432M (± 0.9%) i/s  (225.61 ns/i) -     22.460M in   5.067690s
                  oj      4.834M (± 1.0%) i/s  (206.86 ns/i) -     24.263M in   5.019614s

Comparison:
                json:  4432356.6 i/s
                  oj:  4834123.7 i/s - 1.09x  faster


== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    79.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    671.096 (±11.8%) i/s    (1.49 ms/i) -      3.318k in   5.004342s
                  oj    350.462 (± 4.3%) i/s    (2.85 ms/i) -      1.764k in   5.042945s

Comparison:
                json:      671.1 i/s
                  oj:      350.5 i/s - 1.91x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    58.000 i/100ms
                  oj    35.000 i/100ms
Calculating -------------------------------------
                json    657.830 (±18.7%) i/s    (1.52 ms/i) -      3.190k in   5.014119s
                  oj    346.778 (± 3.5%) i/s    (2.88 ms/i) -      1.750k in   5.052975s

Comparison:
                json:      657.8 i/s
                  oj:      346.8 i/s - 1.90x  slower


== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     7.817k i/100ms
                  oj     7.297k i/100ms
Calculating -------------------------------------
                json     76.183k (± 6.1%) i/s   (13.13 μs/i) -    383.033k in   5.050410s
                  oj     72.630k (± 2.4%) i/s   (13.77 μs/i) -    364.850k in   5.026326s

Comparison:
                json:    76183.3 i/s
                  oj:    72630.4 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.292k i/100ms
                  oj     1.506k i/100ms
Calculating -------------------------------------
                json     22.532k (± 2.9%) i/s   (44.38 μs/i) -    114.600k in   5.090538s
                  oj     15.440k (± 2.2%) i/s   (64.77 μs/i) -     78.312k in   5.074521s

Comparison:
                json:    22531.7 i/s
                  oj:    15440.0 i/s - 1.46x  slower


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   109.000 i/100ms
                  oj    91.000 i/100ms
Calculating -------------------------------------
                json      1.102k (± 1.8%) i/s  (907.47 μs/i) -      5.559k in   5.046287s
                  oj    927.507 (± 1.7%) i/s    (1.08 ms/i) -      4.641k in   5.005234s

Comparison:
                json:     1102.0 i/s
                  oj:      927.5 i/s - 1.19x  slower


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   220.000 i/100ms
                  oj   185.000 i/100ms
Calculating -------------------------------------
                json      2.229k (± 2.5%) i/s  (448.54 μs/i) -     11.220k in   5.035750s
                  oj      1.842k (± 3.4%) i/s  (542.90 μs/i) -      9.250k in   5.027607s

Comparison:
                json:     2229.5 i/s
                  oj:     1842.0 i/s - 1.21x  slower


== Encoding canada.json (2090234 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.000 i/100ms
                  oj     1.000 i/100ms
Calculating -------------------------------------
                json     11.450 (± 0.0%) i/s   (87.33 ms/i) -     58.000 in   5.066321s
                  oj     11.135 (± 0.0%) i/s   (89.81 ms/i) -     56.000 in   5.029813s

Comparison:
                json:       11.5 i/s
                  oj:       11.1 i/s - 1.03x  slower


== Encoding many #to_json calls (2701 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.434k i/100ms
                  oj     2.105k i/100ms
Calculating -------------------------------------
                json     24.448k (± 0.8%) i/s   (40.90 μs/i) -    124.134k in   5.077829s
                  oj     20.934k (± 0.9%) i/s   (47.77 μs/i) -    105.250k in   5.028205s

Comparison:
                json:    24447.9 i/s
                  oj:    20933.5 i/s - 1.17x  slower

byroot · 2025-01-17T06:39:13Z

Seems quite a big deal on some benchmarks:

== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     9.553k i/100ms
Calculating -------------------------------------
               after     96.345k (± 0.5%) i/s   (10.38 μs/i) -    487.203k in   5.056999s

Comparison:
              before:    97143.7 i/s
               after:    96344.5 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.797k i/100ms
Calculating -------------------------------------
               after     28.500k (± 1.4%) i/s   (35.09 μs/i) -    142.647k in   5.006212s

Comparison:
              before:    21655.6 i/s
               after:    28499.8 i/s - 1.32x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   134.000 i/100ms
Calculating -------------------------------------
               after      1.356k (± 0.7%) i/s  (737.44 μs/i) -      6.834k in   5.039867s

Comparison:
              before:     1369.0 i/s
               after:     1356.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   277.000 i/100ms
Calculating -------------------------------------
               after      2.773k (± 0.5%) i/s  (360.57 μs/i) -     14.127k in   5.093951s

Comparison:
              before:     2407.9 i/s
               after:     2773.4 i/s - 1.15x  faster

I was initially not very enthusiastic about SIMD because we need runtime detection for it to be viable, and that seemed like a lot of work, but I very recently found https://github.com/abetlen/simdinfo, which is simple enough, so it could be worth trying.

byroot · 2025-01-17T06:41:00Z

ext/json/ext/generator/generator.c

+    const uint8x16_t lower_bound     = vdupq_n_u8(32); 
+    const uint8x16_t backslash       = vdupq_n_u8(92);
+    const uint8x16_t dblquote        = vdupq_n_u8(34);


Suggested change

const uint8x16_t lower_bound = vdupq_n_u8(32);

const uint8x16_t backslash = vdupq_n_u8(92);

const uint8x16_t dblquote = vdupq_n_u8(34);

const uint8x16_t lower_bound = vdupq_n_u8(' ');

const uint8x16_t backslash = vdupq_n_u8('\\');

const uint8x16_t dblquote = vdupq_n_u8('\"');

Presumably we can use characters here.

byroot · 2025-01-17T06:43:27Z

ext/json/ext/generator/generator.c

+    const uint8x16_t dblquote        = vdupq_n_u8(34);
+
+    while (pos+16 < len) {
+        uint8x16_t chunk    = vld1q_u8((const uint8_t*)&ptr[pos]);


Do you think you could add a few comments?

I usually am of the thinking only the why should be commented, not the how, but SIMD being not a super well known API, I think it's a good exception.

Absolutely! I've been iterating pretty quick on various versions. I'll add some documentation as I think I'm relatively happy with the code at the moment.

byroot · 2025-01-17T06:45:36Z

ext/json/ext/generator/generator.c

+
+        invalid = vorrq_u8(invalid, has_escaped_char);
+
+        if (vmaxvq_u8(invalid) == 0) {


Presumably we could check the number of leading zeros to find the exact position of the first character we need to escape, no?

byroot · 2025-01-17T06:49:53Z

ext/json/ext/generator/generator.c

+        tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));
+
+        uint8_t arr[16];
+        vst1q_u8(arr, tmp);


AH actually, I suppose that's what i was thinking about. It tells you the position of the backslashes and quotes.

byroot · 2025-01-17T06:51:40Z

Note to self, before merging any of that we should setup ARM64 CI: https://github.blog/changelog/2024-09-03-github-actions-arm64-linux-and-windows-runners-are-now-generally-available/

byroot · 2025-01-17T07:36:26Z

I very recently found https://github.com/abetlen/simdinfo, which is simple enough

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning, but not too sure what the support is like in various compilers. ruby/json being included in Ruby, it's tested against some pretty old and exotic compilers.

byroot · 2025-01-17T09:25:55Z

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning,

Nevermind, it's limited to C++ on i386 arch, so basically unusable.

byroot · 2025-01-17T12:44:55Z

ext/json/ext/generator/generator.c

+        uint8x16_t tmp = vandq_u8(too_low, vdupq_n_u8(0x1));
+        tmp = vorrq_u8(tmp, vandq_u8(has_backslash, vdupq_n_u8(0x2)));
+        tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));
+
+        uint8_t arr[16];
+        vst1q_u8(arr, tmp);


I tried removing this part, and simply falling back to the lookup table for the 16B chunk, and got pretty much the same performance, which isn't that surprising, the biggest gain is from being able to quickly scan the happy path.

Turns out this was unnecessary anyway. It didn't matter which case hit if the character needed to be escaped. I removed the chain of vorrq_u8's this morning.

byroot · 2025-01-17T12:46:51Z

ext/json/ext/generator/generator.c

+
+        if (vmaxvq_u8(invalid) == 0) {
+            pos += 16;
+            continue;


I think we could get a lot of performance here by directly writing the NEON register into the destination.

Because ultimately we'll call MEMCPY, which will load the same bytes into a NEON registry too, so it would be way more efficient to do it immediately.

I'm not too sure which intrinsic is used for that, should be the inverse of vld1q_u8

vst1q_u8 is the opposite of vld1q_u8. If we know the capacity of the buffer is large enough, this should be a win.

I was trying that a few minutes ago:

diff --git a/ext/json/ext/generator/generator.c b/ext/json/ext/generator/generator.c index 3112e89..fd7b948 100644 --- a/ext/json/ext/generator/generator.c +++ b/ext/json/ext/generator/generator.c @@ -260,7 +260,12 @@ static void convert_UTF8_to_JSON(FBuffer *out_buffer, VALUE str) invalid = vorrq_u8(invalid, has_escaped_char); if (vmaxvq_u8(invalid) == 0) { + FLUSH_POS(0); + fbuffer_inc_capa(out_buffer, 16); + vst1q_u8((const uint8_t*)(out_buffer->ptr + out_buffer->len), chunk); + out_buffer->len += 16; pos += 16; + beg = pos; continue; }

But if anything it is very sligthly slower, which is surprising, but I may be doing something wrong. I'm on a very spotty airport wifi so not easy to look at doc etc.

That said, I also see there are way to load/store up to 4 128-bit register, so trying to exploit that too may helpo,
and perhaps that's how MEMCPY is still competitive.

…8_to_JSON. It doesn't matter which case is hit when a byte needs to be escaped. In that case, remove the vorr_q chain and simply use the combined 'needs_escape' vector.

byroot · 2025-01-17T13:34:36Z

ext/json/ext/generator/generator.c

+        } else { \
+            pos++; \
+        } 
+
 static void convert_UTF8_to_ASCII_only_JSON(FBuffer *out_buffer, VALUE str, const unsigned char escape_table[256])


FYI: you could entirely ignore convert_UTF8_to_ASCII_only_JSON, it's a very niche feature, not worth the extra complexity.

samyron · 2025-01-18T15:38:55Z

Apologies about the no extconf.h. Forgot to stash that change. I'm currently trying to use the extconf.rb detect NEON (with other implementations coming soon). Additionally trying to add configure flags so we can enable/disable this feature.

byroot · 2025-01-18T15:40:22Z

No worries.

byroot · 2025-01-18T15:54:27Z

ext/json/ext/generator/generator.c

+        uint8x16_t too_low       = vcltq_u8(chunk, lower_bound);
+        uint8x16_t has_backslash = vceqq_u8(chunk, backslash);
+        uint8x16_t has_dblquote  = vceqq_u8(chunk, dblquote);


I think it might be possible to check for both too_low and has_doublequote with a single vqtbl4q_u8, because we're searching for < 32, 34 and 92.

The first two can be a lookup table (0 -> 64), and for the second we can just to a vceqq_u8.

Assuming vqtbl4q_u8 and vceqq_u8 have similar-ish performance (no idea), that would save two instructions.

That is on the list of things I'd like to try. I can't find the reference now, but somewhere in the Neon reference (I believe) it hinted that table lookups can be more expensive than a series of comparison instructions. It's definitely worth a test though.

…d flags. These can be set with the JSON_GENERATOR_CONFIGURE_OPTS environment variable prior to running rake. Additionally, set the stage for different SIMD implementations.

samyron · 2025-01-18T17:08:55Z

Rakefile

@@ -86,7 +86,7 @@ end

 file EXT_GENERATOR_DL => EXT_GENERATOR_SRC do
  cd EXT_GENERATOR_DIR do
-    ruby 'extconf.rb'
+    ruby "extconf.rb #{ENV['JSON_GENERATOR_CONFIGURE_OPTS']}"


There is probably a better way to accomplish this.

Yes, also I cleaned up the Rakefile the other day, so this will cause a merge conflict.

samyron · 2025-01-18T17:09:31Z

ext/json/ext/generator/extconf.h

@@ -0,0 +1,8 @@
+#ifndef EXTCONF_H


This file gets generated by mkmf. It probably shouldn't be checked in...

Yes it shouldn't be committed.

samyron · 2025-01-18T17:10:18Z

ext/json/ext/generator/simd.h

+
+#define SIMD_VEC_STRIDE          16
+
+#define simd_vec_type            uint8x16_t


Hopefully these help readability a bit.

Honestly I prefer uint8x16_t.

That's fair. I did this to be able to support different platforms in a generic way. This likely comes at the cost of efficiency as there may be better platform specific algorithms based on the provided ISA.

I can remove this and implement the multiple paths in the generator if you'd prefer.

samyron · 2025-01-19T05:45:14Z

ext/json/ext/generator/extconf.rb

+          }
+          SRC
+          $defs.push("-DENABLE_SIMD")
+          append_cflags('-mavx2')


This is likely not a safe assumption.

Yes. ruby/json is shipped with ruby which means will likely be distributed as precompiled binary.

What we should try to do is to use the various __target__("arch=...") attributes to compile SIMD enabled versions of some functions, and then use something like https://github.com/abetlen/simdinfo to dispatch at runtime.

That's really the tricky part and blocker.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute

int sse3_func (void) __attribute__ ((__target__ ("sse3")));

Seem to be the syntax, and I think clang supports it.

Looks like both clang and gcc support this idea, differently, of course...

However, I'm working on samyron#1 to refactor to use runtime SIMD detection. I also specialized the SSE version a bit. See the link for a few benchmarks.

samyron · 2025-01-19T05:48:58Z

ext/json/ext/generator/simd.h

+
+#ifdef HAVE_X86INTRIN_H
+#include <x86intrin.h>
+


Everything in this section is likely not optimal. SSE/AVX/AVX2 do not seem to have great support for unsigned bytes.

* Initial support of runtime SIMD detection. * Work in progress implementation of dynamic dispatch. * More clean up. * Added support for AVX2 and quite a bit of refactoring. * __GNU_C__ => __GNUC__ * Major refactoring to reduce the amount of duplicate code between all of the SIMD implementations. Also added a few tests with more characters to ensure the SIMD implementations are tested. * Moved the definition of convert_UTF8_to_JSON_simd.

samyron · 2025-01-24T02:59:42Z

Apologies for going dark for a few days. I've been experimenting with a lookup table based approach. I have this implemented in SSE4.2 and have benchmarks included on that PR. I would expect an ARM Neon implementation to be fairly straight forward as Neon provides table lookup instructions directly in their ISA.

The current state of this PR:

Implements SIMD Kernels: Arm Neon, AVX2 and SSE4.2 (I need to check the instruction tables again as it might work for SSE2).
Determines which kernel to use at runtime for GCC and clang.

byroot · 2025-01-24T09:02:03Z

ext/json/ext/generator/generator.c

+        * shuffle, gather, blend, etc. instructions that may perform significantly
+        * better than what is implemented here.
+        */
+        while (pos+32 < len) {


Something I wonder, is whether we actually have to make sure not to go past the string.

I can't find a clear source on it, but I somewhat remember that it's actually legal to read past the end in some conditions?

Looking at Ruby's search_nonascii, which is some form of SIMD, it doesn't seem to concern itself with the end of the string, however it does concern itself with alignment: https://github.com/ruby/ruby/blob/96a5da67864a15eea7b79e552c7684ddd182f76c/string.c#L671-L748

I had some alignment checking code a few iterations ago. I think I removed it as I switched the code to using unaligned memory access instructions. However, I think, adding it back and using the aligned memory operations might be faster (not sure it'll be measurable though but still likely worth doing).

byroot · 2025-01-24T09:07:20Z

ext/json/ext/generator/generator.c

+                switch(ch_len) {
+                    case 0x1:
+                    case 0x2:
+                    case 0x4:
+                    case 0x8:
+                        ch_len = 9;
+                        break;


I don't get why we bother having a different ch_len for each condition we found when in the end we'll all change them for 9.

Might as well have just a boolean, which should allow to merge all 4 registers into one with some OR.
It also allow to pack them back into a uint32_t, and use number of leading (or trailing) zero instructions to find the position.

I really should benchmark this code specifically. This is all to try to save a memory access of script_safe_escape_table[ch]. It might not safe enough (or anything).

byroot · 2025-01-24T09:14:20Z

Thanks for all your work on this, I'll try to share my thoughts a bit:

I think I'll try to go baby steps. I think at first I'd like to release a version that makes a very small use of simd, as to validate that users don't run into compilation issues, or other runtime issues. Basically validate that the compilation and runtime detection facilities work. (I'm not asking you to simplify your PR, just saying that I will probably end up cherry-picking things from it at first).

Since ruby/json is mirrored into Ruby, it will be compiled with tons of different compilers, on lots of weird architectures and OS. So we probably should expect some nasty bug reports.

Then I need to figure out some proper X86 development environment for myself, otherwise it's not really viable. I have a beefy x86 Windows machine, so I can probably run some VM on it.

Once a version is out using some small amount of SIMD and it has been ironed out, I think we'll progressively be able to use it in more places.

samyron · 2025-01-24T12:44:50Z

Thanks for all your work on this, I'll try to share my thoughts a bit:

I think I'll try to go baby steps. I think at first I'd like to release a version that makes a very small use of simd, as to validate that users don't run into compilation issues, or other runtime issues. Basically validate that the compilation and runtime detection facilities work. (I'm not asking you to simplify your PR, just saying that I will probably end up cherry-picking things from it at first).

Since ruby/json is mirrored into Ruby, it will be compiled with tons of different compilers, on lots of weird architectures and OS. So we probably should expect some nasty bug reports.

Then I need to figure out some proper X86 development environment for myself, otherwise it's not really viable. I have a beefy x86 Windows machine, so I can probably run some VM on it.

Once a version is out using some small amount of SIMD and it has been ironed out, I think we'll progressively be able to use it in more places.

You're welcome. This has been a fun challenge. No pressure to ship any of this. Please feel free to take as much or as little as will help.

I will continue to make the code more robust. I want to add a bunch of longer test cases to ensure all of the SIMD code is hit. Additionally I want to see if I can clean up all of the #ifdef's.

samyron · 2025-01-24T12:46:34Z

After working on the lookup table based approach last night.. I had a thought, "What if I unrolled the convert_UTF8_to_JSON loop and checked if any values were non-zero?" This is basically what the SSE4.2 lookup table is doing...

I tried 4 and 8 and both had similar benchmarks.

while (pos+8 < len) {
        uint8_t c1 = ptr[pos   ];
        uint8_t c2 = ptr[pos + 1];
        uint8_t c3 = ptr[pos + 2];
        uint8_t c4 = ptr[pos + 3];
        uint8_t c5 = ptr[pos + 4];
        uint8_t c6 = ptr[pos + 5];
        uint8_t c7 = ptr[pos + 6];
        uint8_t c8 = ptr[pos + 7];

        unsigned char ch_len1 = escape_table[c1];
        unsigned char ch_len2 = escape_table[c2];
        unsigned char ch_len3 = escape_table[c3];
        unsigned char ch_len4 = escape_table[c4];
        unsigned char ch_len5 = escape_table[c5];
        unsigned char ch_len6 = escape_table[c6];
        unsigned char ch_len7 = escape_table[c7];
        unsigned char ch_len8 = escape_table[c8];

        if ((ch_len1 | ch_len2 | ch_len3 | ch_len4 | ch_len5 | ch_len6 | ch_len7 | ch_len8) == 0) {
            pos += 8;
            continue;
        }

        unsigned char r[8] = {ch_len1, ch_len2, ch_len3, ch_len4, ch_len5, ch_len6, ch_len7, ch_len8};

        for (int i = 0; i < 8; ) {
            unsigned long start = pos;
            unsigned char ch = ptr[pos];
            unsigned char ch_len = r[i];
            // <snip> The switch statement processing each bye.
                i += (pos - start);
            } else {
                pos++;
                i++;
            }
        }
    }

Code in `master`

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    34.000 i/100ms
          json_coder    34.000 i/100ms
                  oj    20.000 i/100ms
Calculating -------------------------------------
                json    342.357 (± 4.1%) i/s    (2.92 ms/i) -      1.734k in   5.075924s
          json_coder    342.794 (± 2.3%) i/s    (2.92 ms/i) -      1.734k in   5.061579s
                  oj    204.881 (± 2.4%) i/s    (4.88 ms/i) -      1.040k in   5.078798s

Comparison:
                json:      342.4 i/s
          json_coder:      342.8 i/s - same-ish: difference falls within error
                  oj:      204.9 i/s - 1.67x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    34.000 i/100ms
          json_coder    32.000 i/100ms
                  oj    19.000 i/100ms
Calculating -------------------------------------
                json    321.234 (± 5.0%) i/s    (3.11 ms/i) -      1.632k in   5.093962s
          json_coder    331.010 (± 3.6%) i/s    (3.02 ms/i) -      1.664k in   5.033434s
                  oj    182.694 (± 6.0%) i/s    (5.47 ms/i) -    912.000 in   5.010977s

Comparison:
                json:      321.2 i/s
          json_coder:      331.0 i/s - same-ish: difference falls within error
                  oj:      182.7 i/s - 1.76x  slower

Unrolled (code above)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    54.000 i/100ms
          json_coder    51.000 i/100ms
                  oj    20.000 i/100ms
Calculating -------------------------------------
                json    530.726 (± 4.9%) i/s    (1.88 ms/i) -      2.646k in   4.999979s
          json_coder    524.238 (± 3.8%) i/s    (1.91 ms/i) -      2.652k in   5.066895s
                  oj    194.875 (± 5.1%) i/s    (5.13 ms/i) -    980.000 in   5.042017s

Comparison:
                json:      530.7 i/s
          json_coder:      524.2 i/s - same-ish: difference falls within error
                  oj:      194.9 i/s - 2.72x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    52.000 i/100ms
          json_coder    52.000 i/100ms
                  oj    18.000 i/100ms
Calculating -------------------------------------
                json    544.152 (± 2.8%) i/s    (1.84 ms/i) -      2.756k in   5.068995s
          json_coder    547.908 (± 2.9%) i/s    (1.83 ms/i) -      2.756k in   5.034590s
                  oj    194.882 (± 2.1%) i/s    (5.13 ms/i) -    990.000 in   5.082430s

Comparison:
                json:      544.2 i/s
          json_coder:      547.9 i/s - same-ish: difference falls within error
                  oj:      194.9 i/s - 2.79x  slower

These are probably the best case benchmarks though. They're both really long strings that are being copied from one buffer to another.

samyron · 2025-01-27T02:56:43Z

I've created a PR in my fork which uses bit twiddling for some significant speedups. Again, this sort of re-implements the rules of JSON parsing instead of using lookup tables to determine which action to take. This effectively inlines two memchr calls and a < 0x20 test. The speedups are impressive on both my M1 Macbook Air and Intel-based Laptop.

Just another option to consider.

WIP ARM Neon SIMD implementation.

8c268cd

byroot reviewed Jan 17, 2025

View reviewed changes

Removed unnecessary instructions from the Neon portion of convert_UTF…

d8eed56

…8_to_JSON. It doesn't matter which case is hit when a byte needs to be escaped. In that case, remove the vorr_q chain and simply use the combined 'needs_escape' vector.

byroot reviewed Jan 17, 2025

View reviewed changes

samyron and others added 2 commits January 18, 2025 09:25

Added documentation and a bit of refactoring.

9beedac

No extconf.h

6b6ff88

byroot reviewed Jan 18, 2025

View reviewed changes

Make SIMD configurable via the --enable-use-simd or --disable-use-sim…

edf90b8

…d flags. These can be set with the JSON_GENERATOR_CONFIGURE_OPTS environment variable prior to running rake. Additionally, set the stage for different SIMD implementations.

samyron commented Jan 18, 2025

View reviewed changes

samyron added 3 commits January 18, 2025 11:35

Replace HAVE_ARM_NEON_H with ENABLE_SIMD.

21d5232

Initial x86 support.

4a04a91

Fix a typo.

6ee867a

samyron commented Jan 19, 2025

View reviewed changes

samyron changed the title ~~WIP ARM Neon SIMD implementation.~~ WIP ARM Neon/AVX2 SIMD implementation. Jan 19, 2025

samyron added 3 commits January 18, 2025 23:51

Removed the extconf.h

ded09f8

Added the extconf.h to .gitignore.

8b281a8

byroot reviewed Jan 24, 2025

View reviewed changes

headius mentioned this pull request Jan 29, 2025

Explore use of Java Vector API for SIMD support #739

Open


		invalid = vorrq_u8(invalid, has_escaped_char);

		if (vmaxvq_u8(invalid) == 0) {


		#ifdef HAVE_X86INTRIN_H
		#include <x86intrin.h>

WIP ARM Neon/AVX2 SIMD implementation. #730

Are you sure you want to change the base?

WIP ARM Neon/AVX2 SIMD implementation. #730

Conversation

samyron commented Jan 17, 2025 • edited Loading

WORK IN PROGRESS

Benchmarks

Baseline: code in master (at the time fork was created)

With SIMD

Benchmark encoding a long string.

Before and After (using the before & after script)

Existing benchmarks

byroot commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byroot commented Jan 17, 2025

byroot commented Jan 17, 2025

byroot commented Jan 17, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samyron commented Jan 18, 2025

byroot commented Jan 18, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

samyron commented Jan 24, 2025

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

byroot commented Jan 24, 2025

samyron commented Jan 24, 2025

samyron commented Jan 24, 2025 • edited Loading

Code in master

Unrolled (code above)

samyron commented Jan 27, 2025 • edited Loading

samyron commented Jan 17, 2025 •

edited

Loading

Baseline: code in `master` (at the time fork was created)

samyron commented Jan 24, 2025 •

edited

Loading

Code in `master`

samyron commented Jan 27, 2025 •

edited

Loading