Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Added branchless Math.Min calculation #24

Open
wants to merge 1 commit into
base: main
Choose a base branch
from
Open

Conversation

Turnerj
Copy link
Owner

@Turnerj Turnerj commented Jun 19, 2020

Based on this tweet: https://mobile.twitter.com/badamczewski01/status/1273759277707132928

Results

|                 Method |       Runtime | NumberOfCharacters |               Mean |               Error |             StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size |      Gen 0 |      Gen 1 |     Gen 2 |   Allocated |
|----------------------- |-------------- |------------------- |-------------------:|--------------------:|-------------------:|------:|--------:|-----------:|--------:|----------:|-----------:|-----------:|----------:|------------:|
|               Baseline |    .NET 4.7.2 |                  0 |         220.189 ns |          24.1378 ns |          1.3231 ns | 1.000 |    1.00 |       1.00 |    0.00 |    2557 B |     0.1018 |          - |         - |       321 B |
|          Quickenshtein |    .NET 4.7.2 |                  0 |           2.583 ns |           0.1834 ns |          0.0101 ns | 0.012 |   85.24 |   1,095.27 |    0.00 |     199 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                  0 |           1.790 ns |           0.5424 ns |          0.0297 ns | 0.008 |  123.05 |   1,757.70 |    0.00 |     179 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                  0 |           2.489 ns |           1.4295 ns |          0.0784 ns | 0.011 |   88.51 |     689.97 |    0.00 |     328 B |          - |          - |         - |           - |
|               Baseline | .NET Core 3.0 |                  0 |         112.910 ns |           4.6300 ns |          0.2538 ns | 0.513 |    1.95 |       2.57 |    0.00 |    1938 B |     0.0764 |          - |         - |       240 B |
|          Quickenshtein | .NET Core 3.0 |                  0 |           4.067 ns |           0.1389 ns |          0.0076 ns | 0.018 |   54.14 |     538.69 |    0.00 |     257 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                  0 |           2.127 ns |           0.7267 ns |          0.0398 ns | 0.010 |  103.53 |   1,423.33 |    0.00 |     186 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                  0 |           2.396 ns |           0.6282 ns |          0.0344 ns | 0.011 |   91.92 |     714.39 |    0.00 |     329 B |          - |          - |         - |           - |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |                 10 |       1,252.135 ns |         120.7927 ns |          6.6211 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B |     0.4463 |          - |         - |      1404 B |
|          Quickenshtein |    .NET 4.7.2 |                 10 |         260.836 ns |          16.0352 ns |          0.8789 ns |  0.21 |    4.80 |       2.23 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                 10 |         262.059 ns |          14.9552 ns |          0.8197 ns |  0.21 |    4.78 |       2.08 |    0.00 |    5878 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                 10 |         258.354 ns |          38.7140 ns |          2.1220 ns |  0.21 |    4.85 |      37.79 |    0.00 |     328 B |     0.0200 |          - |         - |        64 B |
|               Baseline | .NET Core 3.0 |                 10 |         749.766 ns |          34.1184 ns |          1.8701 ns |  0.60 |    1.67 |       2.20 |    0.00 |    1938 B |     0.3443 |          - |         - |      1080 B |
|          Quickenshtein | .NET Core 3.0 |                 10 |         369.429 ns |          22.4171 ns |          1.2288 ns |  0.30 |    3.39 |       1.91 |    0.00 |    4536 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                 10 |         352.946 ns |          12.0771 ns |          0.6620 ns |  0.28 |    3.55 |       1.70 |    0.00 |    5335 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                 10 |         237.126 ns |          23.6289 ns |          1.2952 ns |  0.19 |    5.28 |      41.04 |    0.00 |     329 B |     0.0200 |          - |         - |        64 B |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |                400 |     889,130.599 ns |      66,393.8252 ns |      3,639.2696 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B |   142.5781 |    59.5703 |         - |    668278 B |
|          Quickenshtein |    .NET 4.7.2 |                400 |     250,343.913 ns |      21,465.3674 ns |      1,176.5892 ns |  0.28 |    3.55 |       1.65 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |                400 |     250,369.938 ns |       8,883.9636 ns |        486.9600 ns |  0.28 |    3.55 |       1.54 |    0.00 |    5878 B |          - |          - |         - |           - |
|           Fastenshtein |    .NET 4.7.2 |                400 |     438,383.122 ns |      39,211.2402 ns |      2,149.3004 ns |  0.49 |    2.03 |      15.81 |    0.00 |     328 B |     0.4883 |          - |         - |      1633 B |
|               Baseline | .NET Core 3.0 |                400 |     865,893.945 ns |      72,297.3258 ns |      3,962.8604 ns |  0.97 |    1.03 |       1.35 |    0.00 |    1938 B |   121.0938 |    60.5469 |         - |    657840 B |
|          Quickenshtein | .NET Core 3.0 |                400 |      78,144.973 ns |       6,586.3546 ns |        361.0203 ns |  0.09 |   11.38 |       6.41 |    0.00 |    4536 B |          - |          - |         - |           - |
| Quickenshtein_Threaded | .NET Core 3.0 |                400 |      78,146.643 ns |      11,238.2866 ns |        616.0084 ns |  0.09 |   11.38 |       5.06 |    0.00 |    5754 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |                400 |     357,707.454 ns |      18,138.3426 ns |        994.2238 ns |  0.40 |    2.49 |      19.32 |    0.00 |     329 B |     0.4883 |          - |         - |      1624 B |
|                        |               |                    |                    |                     |                    |       |         |            |         |           |            |            |           |             |
|               Baseline |    .NET 4.7.2 |               8000 | 393,884,266.667 ns |  49,006,820.6381 ns |  2,686,229.2053 ns |  1.00 |    1.00 |       1.00 |    0.00 |    2557 B | 44000.0000 | 23000.0000 | 4000.0000 | 256683568 B |
|          Quickenshtein |    .NET 4.7.2 |               8000 | 100,370,120.000 ns |  13,370,648.7190 ns |    732,890.3735 ns |  0.25 |    3.92 |       1.82 |    0.00 |    5501 B |          - |          - |         - |           - |
| Quickenshtein_Threaded |    .NET 4.7.2 |               8000 |  26,717,654.167 ns |  36,141,920.5933 ns |  1,981,060.6232 ns |  0.07 |   14.80 |       6.83 |    0.01 |    5538 B |          - |          - |         - |      1024 B |
|           Fastenshtein |    .NET 4.7.2 |               8000 | 159,648,433.333 ns |  28,111,703.1594 ns |  1,540,897.3089 ns |  0.41 |    2.47 |      19.23 |    0.00 |     328 B |          - |          - |         - |     32048 B |
|               Baseline | .NET Core 3.0 |               8000 | 433,869,533.333 ns | 248,847,936.6442 ns | 13,640,195.1072 ns |  1.10 |    0.91 |       1.28 |    0.04 |    1812 B | 44000.0000 | 23000.0000 | 4000.0000 | 256352240 B |
|          Quickenshtein | .NET Core 3.0 |               8000 |  24,698,026.042 ns |   2,775,654.4912 ns |    152,142.9887 ns |  0.06 |   15.95 |       8.99 |    0.00 |    4536 B |          - |          - |         - |        14 B |
| Quickenshtein_Threaded | .NET Core 3.0 |               8000 |  24,912,576.042 ns |   2,509,390.7247 ns |    137,548.1732 ns |  0.06 |   15.81 |       7.05 |    0.00 |    5731 B |          - |          - |         - |           - |
|           Fastenshtein | .NET Core 3.0 |               8000 | 137,326,183.333 ns |  18,542,076.7650 ns |  1,016,353.7949 ns |  0.35 |    2.87 |      22.29 |    0.00 |     329 B |          - |          - |         - |     32024 B |

Compared to Master
.NET Framework - 10 Characters: 20% slower
.NET Framework - 400 Characters: 2.7x faster
.NET Framework - 8000 Characters: 1.27x faster
.NET Framework (Threaded) - 8000 Characters: 1.53x faster

Note: When running the "Huge Text Benchmark", performance tanks - not sure why and may not even relate to the change:

|                      Method |        Job |       Runtime | NumberOfCharacters |        Mean |     Error |    StdDev | Ratio | Speedup | Worthiness | RatioSD | Code Size | Gen 0 | Gen 1 | Gen 2 | Allocated |
|---------------------------- |----------- |-------------- |------------------- |------------:|----------:|----------:|------:|--------:|-----------:|--------:|----------:|------:|------:|------:|----------:|
|   Quickenshtein_NoThreading | Job-VQPBMC |    .NET 4.7.2 |               8192 |   132.62 ms |  0.713 ms |  0.596 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5501 B |     - |     - |     - |         - |
|   Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 |               8192 |    22.46 ms |  0.168 ms |  0.157 ms |  0.17 |    5.90 |       7.15 |    0.00 |    4536 B |     - |     - |     - |         - |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
| Quickenshtein_WithThreading | Job-VQPBMC |    .NET 4.7.2 |               8192 |    31.17 ms |  0.621 ms |  1.428 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5523 B |     - |     - |     - |    1024 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 |               8192 |    17.73 ms |  0.208 ms |  0.194 ms |  0.58 |    1.73 |       0.90 |    0.02 |   10611 B |     - |     - |     - |     800 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|                Fastenshtein | Job-VQPBMC |    .NET 4.7.2 |               8192 |   167.37 ms |  1.858 ms |  1.552 ms |  1.00 |    1.00 |       1.00 |    0.00 |     328 B |     - |     - |     - |   32816 B |
|                Fastenshtein | Job-LKMQCA | .NET Core 3.0 |               8192 |   166.87 ms |  0.518 ms |  0.433 ms |  1.00 |    1.00 |       1.00 |    0.01 |     329 B |     - |     - |     - |   33126 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|   Quickenshtein_NoThreading | Job-VQPBMC |    .NET 4.7.2 |              32768 | 2,146.58 ms | 10.799 ms |  9.573 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5501 B |     - |     - |     - |         - |
|   Quickenshtein_NoThreading | Job-LKMQCA | .NET Core 3.0 |              32768 |   365.22 ms |  3.426 ms |  3.205 ms |  0.17 |    5.89 |      11.04 |    0.00 |    2933 B |     - |     - |     - |    1344 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
| Quickenshtein_WithThreading | Job-VQPBMC |    .NET 4.7.2 |              32768 |   449.54 ms |  3.217 ms |  2.512 ms |  1.00 |    1.00 |       1.00 |    0.00 |    5523 B |     - |     - |     - |    8192 B |
| Quickenshtein_WithThreading | Job-LKMQCA | .NET Core 3.0 |              32768 |   273.89 ms |  1.678 ms |  1.487 ms |  0.61 |    1.64 |       0.86 |    0.00 |   10577 B |     - |     - |     - |     800 B |
|                             |            |               |                    |             |           |           |       |         |            |         |           |       |       |       |           |
|                Fastenshtein | Job-VQPBMC |    .NET 4.7.2 |              32768 | 2,703.89 ms | 27.841 ms | 24.680 ms |  1.00 |    1.00 |       1.00 |    0.00 |     328 B |     - |     - |     - |  131096 B |
|                Fastenshtein | Job-LKMQCA | .NET Core 3.0 |              32768 | 2,478.46 ms | 35.907 ms | 33.588 ms |  0.92 |    1.09 |       1.05 |    0.02 |     340 B |     - |     - |     - |  131096 B |

@Turnerj Turnerj added the enhancement New feature or request label Jun 19, 2020
@Turnerj
Copy link
Owner Author

Turnerj commented Nov 30, 2021

Giving this another crack now...

.NET Framework Baseline

BenchmarkDotNet=v0.13.1, OS=Windows 10.0.19043.1348 (21H1/May2021Update)
Intel Core i5-6600K CPU 3.50GHz (Skylake), 1 CPU, 4 logical and 4 physical cores
.NET SDK=6.0.100
  [Host]    : .NET 6.0.0 (6.0.21.52210), X64 RyuJIT
  Framework : .NET Framework 4.8 (4.8.4420.0), X64 RyuJIT

Job=Framework  Runtime=.NET Framework 4.7.2
Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 190.9 ns 0.28 ns 0.26 ns 5,237,715.05 0 0 4,211 B -
Quickenshtein 400 473,919.8 ns 743.11 ns 695.10 ns 2,110.06 235 70,856 4,211 B -
Quickenshtein 8000 73,353,386.7 ns 872,388.31 ns 816,032.56 ns 13.63 39,673 4,212,053 4,211 B -

@Turnerj
Copy link
Owner Author

Turnerj commented Nov 30, 2021

With Branchless Min

Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 219.2 ns 0.23 ns 0.20 ns 4,562,704.02 0 0 6,007 B -
Quickenshtein 400 159,471.9 ns 2,843.29 ns 2,659.62 ns 6,270.70 72 1,251 6,007 B -
Quickenshtein 8000 62,462,987.4 ns 279,845.42 ns 261,767.58 ns 16.01 31,858 202,191 6,007 B -

@Turnerj
Copy link
Owner Author

Turnerj commented Nov 30, 2021

Branchless Min but without Outer Loop Unrolling

Method NumberOfCharacters Mean Error StdDev Op/s BranchMispredictions/Op CacheMisses/Op Code Size Allocated
Quickenshtein 10 212.6 ns 0.28 ns 0.26 ns 4,704,467.28 0 0 2,432 B -
Quickenshtein 400 178,371.5 ns 216.77 ns 202.77 ns 5,606.28 717 64 2,432 B -
Quickenshtein 8000 70,299,671.7 ns 119,200.72 ns 111,500.43 ns 14.22 29,559 29,559 2,432 B -

@Turnerj
Copy link
Owner Author

Turnerj commented Dec 2, 2021

Using SIMD in .NET Framework via Vectors

In defense of the code - it doesn't have ShiftRightLogical128BitLane so my hacked implementation isn't likely optimal.

Method NumberOfCharacters Mean Error StdDev Op/s CacheMisses/Op BranchMispredictions/Op Code Size Allocated
Quickenshtein 10 591.7 ns 0.63 ns 0.59 ns 1,689,997.326 0 0 2,846 B -
Quickenshtein 400 745,208.6 ns 1,060.84 ns 992.31 ns 1,341.906 309 800 2,846 B -
Quickenshtein 8000 288,599,473.3 ns 380,012.54 ns 355,463.96 ns 3.465 132,710 60,894 2,846 B -

@Turnerj
Copy link
Owner Author

Turnerj commented Dec 2, 2021

I tested the branchless code again for huge text and there really isn't any performance benefit which I find kinda strange.
While there are a lot less branch mispredictions, the time taken to process has stayed the same.

Before

Method NumberOfCharacters Mean Error StdDev Op/s Code Size CacheMisses/Op BranchMispredictions/Op Allocated
Quickenshtein 16000 252.2 ms 2.83 ms 2.65 ms 3.965 4,211 B 196,881 8,602,283 -
Quickenshtein 32000 934.9 ms 11.87 ms 11.10 ms 1.070 4,211 B 709,427 19,651,243 -

After

Method NumberOfCharacters Mean Error StdDev Op/s Code Size CacheMisses/Op BranchMispredictions/Op Allocated
Quickenshtein 16000 248.5 ms 0.34 ms 0.32 ms 4.025 6,007 B 177,948 484,147 -
Quickenshtein 32000 991.4 ms 1.49 ms 1.39 ms 1.009 6,007 B 776,602 1,241,361 -

@Turnerj
Copy link
Owner Author

Turnerj commented Dec 2, 2021

For my own curiousity, I wanted to see the scale of how the performance degraded for larger strings:

Method NumberOfCharacters Mean Error StdDev Op/s Code Size BranchMispredictions/Op CacheMisses/Op Allocated
Normal 2000 7.459 ms 0.0457 ms 0.0428 ms 134.073 4,211 B 846,460 3,979 -
Normal 4000 22.967 ms 0.1766 ms 0.1652 ms 43.541 4,211 B 1,983,300 12,143 -
Normal 8000 72.782 ms 0.9586 ms 0.8967 ms 13.740 4,211 B 4,094,440 43,769 -
Normal 12000 149.794 ms 2.5139 ms 2.2285 ms 6.676 4,211 B 6,540,698 92,092 -
Normal 16000 251.899 ms 3.3323 ms 3.1170 ms 3.970 4,211 B 8,565,009 174,399 -
Method NumberOfCharacters Mean Error StdDev Op/s Code Size BranchMispredictions/Op CacheMisses/Op Allocated
PR 2000 4.018 ms 0.0777 ms 0.0983 ms 248.874 6,007 B 39,919 1,970 -
PR 4000 15.846 ms 0.2252 ms 0.2106 ms 63.108 6,007 B 103,287 7,415 -
PR 8000 62.540 ms 0.3767 ms 0.3524 ms 15.990 6,007 B 227,908 32,119 -
PR 12000 140.117 ms 0.4266 ms 0.3782 ms 7.137 6,007 B 362,223 80,896 -
PR 16000 248.454 ms 0.4744 ms 0.4438 ms 4.025 6,007 B 493,522 135,441 -

This translates to the PR being...
2000 = 1.86x faster
4000 = 1.45x faster
8000 = 1.16x faster
12000 = 1.07x faster
16000 = 1.01x faster

Not exactly ideal!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging this pull request may close these issues.

1 participant