Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

WIP ARM Neon/AVX2 SIMD implementation. #730

Draft
wants to merge 11 commits into
base: master
Choose a base branch
from

Conversation

samyron
Copy link

@samyron samyron commented Jan 17, 2025

WORK IN PROGRESS

Initial implementation providing an ARM Neon (and now AVX2) SIMD implementation of the convert_UTF8_to_JSON* functions.

There is still more work to be done on the convert_UTF8_to_ASCII_only_JSON. Right now it only uses SIMD to skip the prefix of characters that do not need escaping. I will fix that in the coming days.

I started the implementation of x86_64 support. Currently using __m256i seems significantly slower on my machine. I plan to make it configurable.

Additionally, the algorithm between the ARM Neon and x86_64 (AVX2) is the same. This may not be ideal as there may be better instructions on one/either platform which would be more efficient.

Benchmarks

My machine is an M1 Macbook Air with 16MB of RAM.

% ruby -v
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]

I do not have YJIT built.

Baseline: code in master (at the time fork was created)

% ruby -Ilib:ext benchmark/encoder-simple.rb 
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.465k i/100ms
                  oj     2.089k i/100ms
Calculating -------------------------------------
                json     14.440k (± 2.5%) i/s   (69.25 μs/i) -     73.250k in   5.075975s
                  oj     20.504k (± 1.9%) i/s   (48.77 μs/i) -    104.450k in   5.096055s

Comparison:
                json:    14439.8 i/s
                  oj:    20503.7 i/s - 1.42x  faster

With SIMD

Benchmark encoding a long string.

% ruby -Ilib:ext benchmark/encoder-simple.rb
== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     3.721k i/100ms
                  oj     2.006k i/100ms
Calculating -------------------------------------
                json     39.497k (± 3.9%) i/s   (25.32 μs/i) -    200.934k in   5.094877s
                  oj     20.486k (± 2.6%) i/s   (48.81 μs/i) -    104.312k in   5.095360s

Comparison:
                json:    39497.0 i/s
                  oj:    20486.2 i/s - 1.93x  slower

Before and After (using the before & after script)

== Encoding long string (124001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
               after     4.386k i/100ms
Calculating -------------------------------------
               after     42.103k (± 4.4%) i/s   (23.75 μs/i) -    210.528k in   5.010512s

Comparison:
              before:    14901.3 i/s
               after:    42103.4 i/s - 2.83x  faster

Existing benchmarks

== Encoding small mixed (34 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   431.030k i/100ms
                  oj   425.858k i/100ms
Calculating -------------------------------------
                json      4.443M (± 0.8%) i/s  (225.05 ns/i) -     22.414M in   5.044440s
                  oj      4.240M (± 1.2%) i/s  (235.86 ns/i) -     21.293M in   5.022778s

Comparison:
                json:  4443492.7 i/s
                  oj:  4239867.2 i/s - 1.05x  slower


== Encoding small nested array (121 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   202.481k i/100ms
                  oj   173.531k i/100ms
Calculating -------------------------------------
                json      2.060M (± 3.4%) i/s  (485.48 ns/i) -     10.327M in   5.020528s
                  oj      1.740M (± 1.7%) i/s  (574.79 ns/i) -      8.850M in   5.088416s

Comparison:
                json:  2059819.6 i/s
                  oj:  1739779.8 i/s - 1.18x  slower


== Encoding small hash (65 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   440.390k i/100ms
                  oj   485.263k i/100ms
Calculating -------------------------------------
                json      4.432M (± 0.9%) i/s  (225.61 ns/i) -     22.460M in   5.067690s
                  oj      4.834M (± 1.0%) i/s  (206.86 ns/i) -     24.263M in   5.019614s

Comparison:
                json:  4432356.6 i/s
                  oj:  4834123.7 i/s - 1.09x  faster


== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    79.000 i/100ms
                  oj    36.000 i/100ms
Calculating -------------------------------------
                json    671.096 (±11.8%) i/s    (1.49 ms/i) -      3.318k in   5.004342s
                  oj    350.462 (± 4.3%) i/s    (2.85 ms/i) -      1.764k in   5.042945s

Comparison:
                json:      671.1 i/s
                  oj:      350.5 i/s - 1.91x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json    58.000 i/100ms
                  oj    35.000 i/100ms
Calculating -------------------------------------
                json    657.830 (±18.7%) i/s    (1.52 ms/i) -      3.190k in   5.014119s
                  oj    346.778 (± 3.5%) i/s    (2.88 ms/i) -      1.750k in   5.052975s

Comparison:
                json:      657.8 i/s
                  oj:      346.8 i/s - 1.90x  slower


== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     7.817k i/100ms
                  oj     7.297k i/100ms
Calculating -------------------------------------
                json     76.183k (± 6.1%) i/s   (13.13 μs/i) -    383.033k in   5.050410s
                  oj     72.630k (± 2.4%) i/s   (13.77 μs/i) -    364.850k in   5.026326s

Comparison:
                json:    76183.3 i/s
                  oj:    72630.4 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.292k i/100ms
                  oj     1.506k i/100ms
Calculating -------------------------------------
                json     22.532k (± 2.9%) i/s   (44.38 μs/i) -    114.600k in   5.090538s
                  oj     15.440k (± 2.2%) i/s   (64.77 μs/i) -     78.312k in   5.074521s

Comparison:
                json:    22531.7 i/s
                  oj:    15440.0 i/s - 1.46x  slower


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   109.000 i/100ms
                  oj    91.000 i/100ms
Calculating -------------------------------------
                json      1.102k (± 1.8%) i/s  (907.47 μs/i) -      5.559k in   5.046287s
                  oj    927.507 (± 1.7%) i/s    (1.08 ms/i) -      4.641k in   5.005234s

Comparison:
                json:     1102.0 i/s
                  oj:      927.5 i/s - 1.19x  slower


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json   220.000 i/100ms
                  oj   185.000 i/100ms
Calculating -------------------------------------
                json      2.229k (± 2.5%) i/s  (448.54 μs/i) -     11.220k in   5.035750s
                  oj      1.842k (± 3.4%) i/s  (542.90 μs/i) -      9.250k in   5.027607s

Comparison:
                json:     2229.5 i/s
                  oj:     1842.0 i/s - 1.21x  slower


== Encoding canada.json (2090234 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     1.000 i/100ms
                  oj     1.000 i/100ms
Calculating -------------------------------------
                json     11.450 (± 0.0%) i/s   (87.33 ms/i) -     58.000 in   5.066321s
                  oj     11.135 (± 0.0%) i/s   (89.81 ms/i) -     56.000 in   5.029813s

Comparison:
                json:       11.5 i/s
                  oj:       11.1 i/s - 1.03x  slower


== Encoding many #to_json calls (2701 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [arm64-darwin24]
Warming up --------------------------------------
                json     2.434k i/100ms
                  oj     2.105k i/100ms
Calculating -------------------------------------
                json     24.448k (± 0.8%) i/s   (40.90 μs/i) -    124.134k in   5.077829s
                  oj     20.934k (± 0.9%) i/s   (47.77 μs/i) -    105.250k in   5.028205s

Comparison:
                json:    24447.9 i/s
                  oj:    20933.5 i/s - 1.17x  slower

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Seems quite a big deal on some benchmarks:

== Encoding integers (8009 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     9.553k i/100ms
Calculating -------------------------------------
               after     96.345k (± 0.5%) i/s   (10.38 μs/i) -    487.203k in   5.056999s

Comparison:
              before:    97143.7 i/s
               after:    96344.5 i/s - same-ish: difference falls within error


== Encoding activitypub.json (52595 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after     2.797k i/100ms
Calculating -------------------------------------
               after     28.500k (± 1.4%) i/s   (35.09 μs/i) -    142.647k in   5.006212s

Comparison:
              before:    21655.6 i/s
               after:    28499.8 i/s - 1.32x  faster


== Encoding citm_catalog.json (500298 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   134.000 i/100ms
Calculating -------------------------------------
               after      1.356k (± 0.7%) i/s  (737.44 μs/i) -      6.834k in   5.039867s

Comparison:
              before:     1369.0 i/s
               after:     1356.0 i/s - same-ish: difference falls within error


== Encoding twitter.json (466906 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +YJIT +PRISM [arm64-darwin23]
Warming up --------------------------------------
               after   277.000 i/100ms
Calculating -------------------------------------
               after      2.773k (± 0.5%) i/s  (360.57 μs/i) -     14.127k in   5.093951s

Comparison:
              before:     2407.9 i/s
               after:     2773.4 i/s - 1.15x  faster

I was initially not very enthusiastic about SIMD because we need runtime detection for it to be viable, and that seemed like a lot of work, but I very recently found https://github.com/abetlen/simdinfo, which is simple enough, so it could be worth trying.

Comment on lines 246 to 248
const uint8x16_t lower_bound = vdupq_n_u8(32);
const uint8x16_t backslash = vdupq_n_u8(92);
const uint8x16_t dblquote = vdupq_n_u8(34);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
const uint8x16_t lower_bound = vdupq_n_u8(32);
const uint8x16_t backslash = vdupq_n_u8(92);
const uint8x16_t dblquote = vdupq_n_u8(34);
const uint8x16_t lower_bound = vdupq_n_u8(' ');
const uint8x16_t backslash = vdupq_n_u8('\\');
const uint8x16_t dblquote = vdupq_n_u8('\"');

Presumably we can use characters here.

const uint8x16_t dblquote = vdupq_n_u8(34);

while (pos+16 < len) {
uint8x16_t chunk = vld1q_u8((const uint8_t*)&ptr[pos]);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do you think you could add a few comments?

I usually am of the thinking only the why should be commented, not the how, but SIMD being not a super well known API, I think it's a good exception.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Absolutely! I've been iterating pretty quick on various versions. I'll add some documentation as I think I'm relatively happy with the code at the moment.


invalid = vorrq_u8(invalid, has_escaped_char);

if (vmaxvq_u8(invalid) == 0) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably we could check the number of leading zeros to find the exact position of the first character we need to escape, no?

tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));

uint8_t arr[16];
vst1q_u8(arr, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

AH actually, I suppose that's what i was thinking about. It tells you the position of the backslashes and quotes.

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Note to self, before merging any of that we should setup ARM64 CI: https://github.blog/changelog/2024-09-03-github-actions-arm64-linux-and-windows-runners-are-now-generally-available/

@byroot
Copy link
Member

byroot commented Jan 17, 2025

I very recently found https://github.com/abetlen/simdinfo, which is simple enough

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning, but not too sure what the support is like in various compilers. ruby/json being included in Ruby, it's tested against some pretty old and exotic compilers.

@byroot
Copy link
Member

byroot commented Jan 17, 2025

Another possibility is https://gcc.gnu.org/wiki/FunctionMultiVersioning,

Nevermind, it's limited to C++ on i386 arch, so basically unusable.

Comment on lines 267 to 272
uint8x16_t tmp = vandq_u8(too_low, vdupq_n_u8(0x1));
tmp = vorrq_u8(tmp, vandq_u8(has_backslash, vdupq_n_u8(0x2)));
tmp = vorrq_u8(tmp, vandq_u8(has_dblquote, vdupq_n_u8(0x4)));

uint8_t arr[16];
vst1q_u8(arr, tmp);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I tried removing this part, and simply falling back to the lookup table for the 16B chunk, and got pretty much the same performance, which isn't that surprising, the biggest gain is from being able to quickly scan the happy path.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Turns out this was unnecessary anyway. It didn't matter which case hit if the character needed to be escaped. I removed the chain of vorrq_u8's this morning.


if (vmaxvq_u8(invalid) == 0) {
pos += 16;
continue;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we could get a lot of performance here by directly writing the NEON register into the destination.

Because ultimately we'll call MEMCPY, which will load the same bytes into a NEON registry too, so it would be way more efficient to do it immediately.

I'm not too sure which intrinsic is used for that, should be the inverse of vld1q_u8

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

vst1q_u8 is the opposite of vld1q_u8. If we know the capacity of the buffer is large enough, this should be a win.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I was trying that a few minutes ago:

diff --git a/ext/json/ext/generator/generator.c b/ext/json/ext/generator/generator.c
index 3112e89..fd7b948 100644
--- a/ext/json/ext/generator/generator.c
+++ b/ext/json/ext/generator/generator.c
@@ -260,7 +260,12 @@ static void convert_UTF8_to_JSON(FBuffer *out_buffer, VALUE str)
         invalid = vorrq_u8(invalid, has_escaped_char);
 
         if (vmaxvq_u8(invalid) == 0) {
+            FLUSH_POS(0);
+            fbuffer_inc_capa(out_buffer, 16);
+            vst1q_u8((const uint8_t*)(out_buffer->ptr + out_buffer->len), chunk);
+            out_buffer->len += 16;
             pos += 16;
+            beg = pos;
             continue;
         }

But if anything it is very sligthly slower, which is surprising, but I may be doing something wrong. I'm on a very spotty airport wifi so not easy to look at doc etc.

That said, I also see there are way to load/store up to 4 128-bit register, so trying to exploit that too may helpo,
and perhaps that's how MEMCPY is still competitive.

…8_to_JSON. It doesn't matter which case is hit when a byte needs to be escaped. In that case, remove the vorr_q chain and simply use the combined 'needs_escape' vector.
} else { \
pos++; \
}

static void convert_UTF8_to_ASCII_only_JSON(FBuffer *out_buffer, VALUE str, const unsigned char escape_table[256])
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

FYI: you could entirely ignore convert_UTF8_to_ASCII_only_JSON, it's a very niche feature, not worth the extra complexity.

@samyron
Copy link
Author

samyron commented Jan 18, 2025

Apologies about the no extconf.h. Forgot to stash that change. I'm currently trying to use the extconf.rb detect NEON (with other implementations coming soon). Additionally trying to add configure flags so we can enable/disable this feature.

@byroot
Copy link
Member

byroot commented Jan 18, 2025

No worries.

Comment on lines 295 to 297
uint8x16_t too_low = vcltq_u8(chunk, lower_bound);
uint8x16_t has_backslash = vceqq_u8(chunk, backslash);
uint8x16_t has_dblquote = vceqq_u8(chunk, dblquote);
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it might be possible to check for both too_low and has_doublequote with a single vqtbl4q_u8, because we're searching for < 32, 34 and 92.

The first two can be a lookup table (0 -> 64), and for the second we can just to a vceqq_u8.

Assuming vqtbl4q_u8 and vceqq_u8 have similar-ish performance (no idea), that would save two instructions.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That is on the list of things I'd like to try. I can't find the reference now, but somewhere in the Neon reference (I believe) it hinted that table lookups can be more expensive than a series of comparison instructions. It's definitely worth a test though.

…d flags. These can be set with the JSON_GENERATOR_CONFIGURE_OPTS environment variable prior to running rake. Additionally, set the stage for different SIMD implementations.
@@ -86,7 +86,7 @@ end

file EXT_GENERATOR_DL => EXT_GENERATOR_SRC do
cd EXT_GENERATOR_DIR do
ruby 'extconf.rb'
ruby "extconf.rb #{ENV['JSON_GENERATOR_CONFIGURE_OPTS']}"
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

There is probably a better way to accomplish this.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, also I cleaned up the Rakefile the other day, so this will cause a merge conflict.

@@ -0,0 +1,8 @@
#ifndef EXTCONF_H
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file gets generated by mkmf. It probably shouldn't be checked in...

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes it shouldn't be committed.


#define SIMD_VEC_STRIDE 16

#define simd_vec_type uint8x16_t
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hopefully these help readability a bit.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Honestly I prefer uint8x16_t.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's fair. I did this to be able to support different platforms in a generic way. This likely comes at the cost of efficiency as there may be better platform specific algorithms based on the provided ISA.

I can remove this and implement the multiple paths in the generator if you'd prefer.

}
SRC
$defs.push("-DENABLE_SIMD")
append_cflags('-mavx2')
Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is likely not a safe assumption.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes. ruby/json is shipped with ruby which means will likely be distributed as precompiled binary.

What we should try to do is to use the various __target__("arch=...") attributes to compile SIMD enabled versions of some functions, and then use something like https://github.com/abetlen/simdinfo to dispatch at runtime.

That's really the tricky part and blocker.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

https://gcc.gnu.org/onlinedocs/gcc/Common-Function-Attributes.html#index-target-function-attribute

int sse3_func (void) __attribute__ ((__target__ ("sse3")));

Seem to be the syntax, and I think clang supports it.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks like both clang and gcc support this idea, differently, of course...

However, I'm working on samyron#1 to refactor to use runtime SIMD detection. I also specialized the SSE version a bit. See the link for a few benchmarks.


#ifdef HAVE_X86INTRIN_H
#include <x86intrin.h>

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Everything in this section is likely not optimal. SSE/AVX/AVX2 do not seem to have great support for unsigned bytes.

@samyron samyron changed the title WIP ARM Neon SIMD implementation. WIP ARM Neon/AVX2 SIMD implementation. Jan 19, 2025
* Initial support of runtime SIMD detection.

* Work in progress implementation of dynamic dispatch.

* More clean up.

* Added support for AVX2 and quite a bit of refactoring.

* __GNU_C__ => __GNUC__

* Major refactoring to reduce the amount of duplicate code between all of the SIMD implementations. Also added a few tests with more characters to ensure the SIMD implementations are tested.

* Moved the definition of convert_UTF8_to_JSON_simd.
@samyron
Copy link
Author

samyron commented Jan 24, 2025

Apologies for going dark for a few days. I've been experimenting with a lookup table based approach. I have this implemented in SSE4.2 and have benchmarks included on that PR. I would expect an ARM Neon implementation to be fairly straight forward as Neon provides table lookup instructions directly in their ISA.

The current state of this PR:

  1. Implements SIMD Kernels: Arm Neon, AVX2 and SSE4.2 (I need to check the instruction tables again as it might work for SSE2).
  2. Determines which kernel to use at runtime for GCC and clang.

* shuffle, gather, blend, etc. instructions that may perform significantly
* better than what is implemented here.
*/
while (pos+32 < len) {
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Something I wonder, is whether we actually have to make sure not to go past the string.

I can't find a clear source on it, but I somewhat remember that it's actually legal to read past the end in some conditions?

Looking at Ruby's search_nonascii, which is some form of SIMD, it doesn't seem to concern itself with the end of the string, however it does concern itself with alignment: https://github.com/ruby/ruby/blob/96a5da67864a15eea7b79e552c7684ddd182f76c/string.c#L671-L748

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I had some alignment checking code a few iterations ago. I think I removed it as I switched the code to using unaligned memory access instructions. However, I think, adding it back and using the aligned memory operations might be faster (not sure it'll be measurable though but still likely worth doing).

Comment on lines +620 to +626
switch(ch_len) {
case 0x1:
case 0x2:
case 0x4:
case 0x8:
ch_len = 9;
break;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't get why we bother having a different ch_len for each condition we found when in the end we'll all change them for 9.

Might as well have just a boolean, which should allow to merge all 4 registers into one with some OR.
It also allow to pack them back into a uint32_t, and use number of leading (or trailing) zero instructions to find the position.

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I really should benchmark this code specifically. This is all to try to save a memory access of script_safe_escape_table[ch]. It might not safe enough (or anything).

@byroot
Copy link
Member

byroot commented Jan 24, 2025

Thanks for all your work on this, I'll try to share my thoughts a bit:

I think I'll try to go baby steps. I think at first I'd like to release a version that makes a very small use of simd, as to validate that users don't run into compilation issues, or other runtime issues. Basically validate that the compilation and runtime detection facilities work. (I'm not asking you to simplify your PR, just saying that I will probably end up cherry-picking things from it at first).

Since ruby/json is mirrored into Ruby, it will be compiled with tons of different compilers, on lots of weird architectures and OS. So we probably should expect some nasty bug reports.

Then I need to figure out some proper X86 development environment for myself, otherwise it's not really viable. I have a beefy x86 Windows machine, so I can probably run some VM on it.

Once a version is out using some small amount of SIMD and it has been ironed out, I think we'll progressively be able to use it in more places.

@samyron
Copy link
Author

samyron commented Jan 24, 2025

Thanks for all your work on this, I'll try to share my thoughts a bit:

I think I'll try to go baby steps. I think at first I'd like to release a version that makes a very small use of simd, as to validate that users don't run into compilation issues, or other runtime issues. Basically validate that the compilation and runtime detection facilities work. (I'm not asking you to simplify your PR, just saying that I will probably end up cherry-picking things from it at first).

Since ruby/json is mirrored into Ruby, it will be compiled with tons of different compilers, on lots of weird architectures and OS. So we probably should expect some nasty bug reports.

Then I need to figure out some proper X86 development environment for myself, otherwise it's not really viable. I have a beefy x86 Windows machine, so I can probably run some VM on it.

Once a version is out using some small amount of SIMD and it has been ironed out, I think we'll progressively be able to use it in more places.

You're welcome. This has been a fun challenge. No pressure to ship any of this. Please feel free to take as much or as little as will help.

I will continue to make the code more robust. I want to add a bunch of longer test cases to ensure all of the SIMD code is hit. Additionally I want to see if I can clean up all of the #ifdef's.

@samyron
Copy link
Author

samyron commented Jan 24, 2025

After working on the lookup table based approach last night.. I had a thought, "What if I unrolled the convert_UTF8_to_JSON loop and checked if any values were non-zero?" This is basically what the SSE4.2 lookup table is doing...

I tried 4 and 8 and both had similar benchmarks.

while (pos+8 < len) {
        uint8_t c1 = ptr[pos   ];
        uint8_t c2 = ptr[pos + 1];
        uint8_t c3 = ptr[pos + 2];
        uint8_t c4 = ptr[pos + 3];
        uint8_t c5 = ptr[pos + 4];
        uint8_t c6 = ptr[pos + 5];
        uint8_t c7 = ptr[pos + 6];
        uint8_t c8 = ptr[pos + 7];

        unsigned char ch_len1 = escape_table[c1];
        unsigned char ch_len2 = escape_table[c2];
        unsigned char ch_len3 = escape_table[c3];
        unsigned char ch_len4 = escape_table[c4];
        unsigned char ch_len5 = escape_table[c5];
        unsigned char ch_len6 = escape_table[c6];
        unsigned char ch_len7 = escape_table[c7];
        unsigned char ch_len8 = escape_table[c8];

        if ((ch_len1 | ch_len2 | ch_len3 | ch_len4 | ch_len5 | ch_len6 | ch_len7 | ch_len8) == 0) {
            pos += 8;
            continue;
        }

        unsigned char r[8] = {ch_len1, ch_len2, ch_len3, ch_len4, ch_len5, ch_len6, ch_len7, ch_len8};

        for (int i = 0; i < 8; ) {
            unsigned long start = pos;
            unsigned char ch = ptr[pos];
            unsigned char ch_len = r[i];
            // <snip> The switch statement processing each bye.
                i += (pos - start);
            } else {
                pos++;
                i++;
            }
        }
    }

Code in master

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    34.000 i/100ms
          json_coder    34.000 i/100ms
                  oj    20.000 i/100ms
Calculating -------------------------------------
                json    342.357 (± 4.1%) i/s    (2.92 ms/i) -      1.734k in   5.075924s
          json_coder    342.794 (± 2.3%) i/s    (2.92 ms/i) -      1.734k in   5.061579s
                  oj    204.881 (± 2.4%) i/s    (4.88 ms/i) -      1.040k in   5.078798s

Comparison:
                json:      342.4 i/s
          json_coder:      342.8 i/s - same-ish: difference falls within error
                  oj:      204.9 i/s - 1.67x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    34.000 i/100ms
          json_coder    32.000 i/100ms
                  oj    19.000 i/100ms
Calculating -------------------------------------
                json    321.234 (± 5.0%) i/s    (3.11 ms/i) -      1.632k in   5.093962s
          json_coder    331.010 (± 3.6%) i/s    (3.02 ms/i) -      1.664k in   5.033434s
                  oj    182.694 (± 6.0%) i/s    (5.47 ms/i) -    912.000 in   5.010977s

Comparison:
                json:      321.2 i/s
          json_coder:      331.0 i/s - same-ish: difference falls within error
                  oj:      182.7 i/s - 1.76x  slower

Unrolled (code above)

== Encoding mixed utf8 (5003001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    54.000 i/100ms
          json_coder    51.000 i/100ms
                  oj    20.000 i/100ms
Calculating -------------------------------------
                json    530.726 (± 4.9%) i/s    (1.88 ms/i) -      2.646k in   4.999979s
          json_coder    524.238 (± 3.8%) i/s    (1.91 ms/i) -      2.652k in   5.066895s
                  oj    194.875 (± 5.1%) i/s    (5.13 ms/i) -    980.000 in   5.042017s

Comparison:
                json:      530.7 i/s
          json_coder:      524.2 i/s - same-ish: difference falls within error
                  oj:      194.9 i/s - 2.72x  slower


== Encoding mostly utf8 (5001001 bytes)
ruby 3.4.1 (2024-12-25 revision 48d4efcb85) +PRISM [x86_64-linux]
Warming up --------------------------------------
                json    52.000 i/100ms
          json_coder    52.000 i/100ms
                  oj    18.000 i/100ms
Calculating -------------------------------------
                json    544.152 (± 2.8%) i/s    (1.84 ms/i) -      2.756k in   5.068995s
          json_coder    547.908 (± 2.9%) i/s    (1.83 ms/i) -      2.756k in   5.034590s
                  oj    194.882 (± 2.1%) i/s    (5.13 ms/i) -    990.000 in   5.082430s

Comparison:
                json:      544.2 i/s
          json_coder:      547.9 i/s - same-ish: difference falls within error
                  oj:      194.9 i/s - 2.79x  slower

These are probably the best case benchmarks though. They're both really long strings that are being copied from one buffer to another.

@samyron
Copy link
Author

samyron commented Jan 27, 2025

I've created a PR in my fork which uses bit twiddling for some significant speedups. Again, this sort of re-implements the rules of JSON parsing instead of using lookup tables to determine which action to take. This effectively inlines two memchr calls and a < 0x20 test. The speedups are impressive on both my M1 Macbook Air and Intel-based Laptop.

Just another option to consider.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants