[SYSTEMDS-???] SchemaApply Performance Tests #1869

Baunsgaard · 2023-07-24T13:50:10Z

This PR contains code to compare schema compression with our standard compression.
There are good indications that the schema apply is much faster, in the currently supported schemas.

In the following tables the update scheme takes the compressed scheme from one of the previously compressed blocks, and update the scheme to enable compression of new given blocks.
The apply scheme takes the scheme and applies to incoming uncompressed MatrixBlocks.
Update and Apply does both.
From Empty, takes an empty DDC single column scheme (one group per column) and materialize a new scheme for each block given and then applies it to the given MatrixBlock returning a compressed block.

Scaling number of unique values;

 SchemaTest  Repetitions: 100 rand(1000, 1000, 1, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.615, 3.789, 3.868, 4.063, 10.875],           
          Compress Normal 10 blocks,      [315.406, 320.236, 322.712, 325.224, 576.060],           
                      Update Scheme,               [0.641, 0.685, 0.727, 0.839, 28.160],           
                       Apply Scheme,                [0.001, 0.002, 0.002, 0.016, 0.123],           
              Update & Apply Scheme,                [0.651, 0.687, 0.711, 0.790, 2.026],           
   From Empty Update & Apply Scheme,           [16.994, 18.789, 19.555, 20.463, 36.124],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 2, 1.0) Seed: 42
                 Sum Task -- Warmup,                [3.602, 3.740, 3.806, 3.896, 6.162],           
          Compress Normal 10 blocks,          [56.112, 57.636, 57.688, 58.840, 259.717],           
                      Update Scheme,           [12.895, 13.360, 13.807, 14.775, 42.019],           
                       Apply Scheme,           [18.653, 19.696, 20.242, 20.479, 61.259],           
              Update & Apply Scheme,           [30.568, 32.189, 33.011, 34.520, 67.393],           
   From Empty Update & Apply Scheme,           [30.954, 32.023, 32.419, 33.195, 66.499],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 4, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.603, 3.789, 3.855, 3.898, 10.970],           
          Compress Normal 10 blocks,          [45.412, 47.698, 49.687, 50.207, 301.089],           
                      Update Scheme,           [17.774, 18.715, 19.329, 20.105, 54.371],           
                       Apply Scheme,           [18.380, 19.649, 20.470, 22.015, 58.429],           
              Update & Apply Scheme,          [32.330, 34.054, 35.110, 36.817, 109.912],           
   From Empty Update & Apply Scheme,           [21.964, 23.331, 23.737, 24.239, 47.941],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 8, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.566, 3.833, 3.903, 4.052, 13.631],           
          Compress Normal 10 blocks,          [57.973, 59.841, 65.434, 68.550, 231.100],           
                      Update Scheme,           [17.687, 19.650, 20.650, 21.923, 42.305],           
                       Apply Scheme,           [18.694, 20.381, 21.318, 22.833, 36.688],           
              Update & Apply Scheme,           [33.496, 35.713, 36.317, 38.844, 87.031],           
   From Empty Update & Apply Scheme,          [29.735, 31.057, 31.683, 32.681, 104.674],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 16, 1.0) Seed: 42
                 Sum Task -- Warmup,                [3.648, 3.832, 3.873, 3.953, 8.658],           
          Compress Normal 10 blocks,          [46.191, 47.782, 48.734, 49.146, 207.689],           
                      Update Scheme,           [14.298, 15.248, 15.669, 16.347, 43.943],           
                       Apply Scheme,           [12.059, 13.158, 13.662, 14.496, 35.420],           
              Update & Apply Scheme,           [26.971, 30.733, 31.230, 32.385, 74.089],           
   From Empty Update & Apply Scheme,          [39.140, 41.103, 42.154, 42.686, 102.477],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.576, 3.741, 3.804, 4.132, 13.777],           
          Compress Normal 10 blocks,          [78.753, 80.527, 81.096, 81.122, 225.091],           
                      Update Scheme,           [16.839, 18.050, 18.472, 18.996, 24.060],           
                       Apply Scheme,           [13.436, 14.956, 15.647, 16.289, 32.928],           
              Update & Apply Scheme,           [29.979, 31.771, 33.240, 33.991, 98.724],           
   From Empty Update & Apply Scheme,          [53.602, 58.724, 59.901, 60.409, 130.987],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 64, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.600, 3.855, 3.997, 4.534, 12.875],           
          Compress Normal 10 blocks,          [99.184, 99.643, 99.910, 99.974, 271.286],           
                      Update Scheme,           [16.941, 18.505, 18.938, 19.277, 47.379],           
                       Apply Scheme,           [15.055, 16.136, 16.628, 17.252, 54.517],           
              Update & Apply Scheme,           [30.287, 33.366, 34.128, 35.208, 97.795],           
   From Empty Update & Apply Scheme,          [70.528, 72.287, 73.163, 73.559, 161.978],

Scaling number of columns

          SchemaTest  Repetitions: 100 rand(1000, 1, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [0.067, 0.087, 0.124, 0.154, 9.823],           
          Compress Normal 10 blocks,              [1.508, 1.540, 1.649, 1.720, 116.049],           
                      Update Scheme,                [0.054, 0.121, 0.177, 0.233, 0.530],           
                       Apply Scheme,                [0.090, 0.154, 0.313, 0.408, 0.851],           
              Update & Apply Scheme,                [0.083, 0.134, 0.138, 0.148, 0.575],           
   From Empty Update & Apply Scheme,                [0.202, 0.225, 0.254, 0.263, 9.700],           
          SchemaTest  Repetitions: 100 rand(1000, 10, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [0.182, 0.202, 0.214, 0.337, 0.707],           
          Compress Normal 10 blocks,             [6.051, 7.294, 12.022, 13.442, 35.295],           
                      Update Scheme,                [0.081, 0.380, 0.384, 0.397, 0.570],           
                       Apply Scheme,                [0.337, 0.351, 0.360, 0.385, 0.581],           
              Update & Apply Scheme,                [0.694, 0.716, 0.724, 0.738, 1.086],           
   From Empty Update & Apply Scheme,                [1.702, 1.727, 1.869, 1.963, 2.251],           
          SchemaTest  Repetitions: 100 rand(1000, 100, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [1.093, 1.506, 1.527, 1.639, 1.829],           
          Compress Normal 10 blocks,            [8.636, 12.370, 44.013, 51.628, 68.924],           
                      Update Scheme,                [0.892, 3.599, 3.769, 4.093, 4.498],           
                       Apply Scheme,                [3.112, 3.367, 3.493, 3.829, 7.980],           
              Update & Apply Scheme,               [4.004, 6.603, 6.927, 7.143, 13.654],           
   From Empty Update & Apply Scheme,               [3.804, 3.999, 4.067, 4.762, 22.411],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.608, 3.867, 4.056, 5.260, 12.220],           
          Compress Normal 10 blocks,          [70.259, 72.494, 74.850, 76.022, 165.416],           
                      Update Scheme,           [13.331, 13.997, 14.522, 15.124, 40.080],           
                       Apply Scheme,             [9.139, 9.610, 10.026, 10.510, 35.005],           
              Update & Apply Scheme,           [22.860, 24.072, 24.813, 25.692, 87.585],           
   From Empty Update & Apply Scheme,           [45.950, 47.302, 47.617, 48.151, 90.668],

Baunsgaard · 2023-07-24T17:10:29Z

Improve Column combining when the different statistics have equal cost (for instance if it is constants.)
Before:

          Compress Normal 10 blocks,      [315.406, 320.236, 322.712, 325.224, 576.060],

after:

          Compress Normal 10 blocks,      [106.500, 116.676, 122.512, 164.084, 433.367],

Baunsgaard · 2023-07-25T09:30:37Z

And a shortcut optimization:

Compress Normal 10 blocks,          [16.764, 18.058, 18.841, 19.047, 148.369],

Baunsgaard · 2023-07-26T12:33:45Z

Initial in memory performance test of the Compression and processing speed,
The results show that we can update and apply a compression scheme at high rate throughput.
The numbers indicate Input throughput and output throughput,
As an example Sum is performed at 7.9GB/s input and produce output at 8.33 KB/s

           WriteTest  Repetitions: 100 ConstMatrix 1000, 1000, 1.0)
      Warmup Sum task Single Thread, [   0.443,    0.462,    0.479,    0.567,   26.832],   7.94 GB/s In,   8.33 KB/s Out
          Warmup Sum task Parallel , [   0.422,    0.441,    0.511,    0.561,    3.409],  14.52 GB/s In,  15.22 KB/s Out
Compression In Memory Single Thread, [  55.041,   56.349,   56.940,   58.169,  232.053], 128.03 MB/s In,  22.66 MB/s Out
Compression In Memory Parallel     , [  15.457,   33.681,   47.115,   49.519,   54.410], 189.31 MB/s In,  33.51 MB/s Out
              Update & Apply Scheme, [  18.761,   20.947,   21.263,   21.495,   39.948], 358.98 MB/s In,  63.54 MB/s Out
                       Apply Scheme, [   8.682,    8.751,    8.815,    8.928,   11.203], 857.06 MB/s In, 151.71 MB/s Out
     Update & Apply Scheme Parallel, [   3.099,    3.567,    3.652,    3.759,    8.134],   2.02 GB/s In, 366.32 MB/s Out
              Apply Scheme Parallel, [   1.204,    1.513,    1.561,    1.600,    4.642],   4.72 GB/s In, 856.16 MB/s Out

FYI @mboehm7 , Do you think this is the right direction?

Following tests will confirm if this also works when writing to disk.

Baunsgaard · 2023-07-31T20:03:56Z

Image showing the performance at different numbers of unique values.

Baunsgaard · 2023-08-01T12:16:15Z

On So010, with a 1000 x 1000 matrix and 48 threads in use:

we get 40+-8 GB input processed with MatrixVector multiplication with a peak MemoryBandwith of 200 GB

           WriteTest  Repetitions: 1000 ConstMatrix ( Rows:1000, Cols:1000, Spar:1.0, Unique: 32)
                                Sum,    0.401+- 0.045 ms,  19974765027+-  1877895561 Byte/s,       399488+-       37557 Byte/s
                            MV mult,    0.182+- 0.036 ms,  43905190543+-  8386160115 Byte/s,     44738539+-     8545335 Byte/s
                Update&Apply Scheme,    3.529+- 0.186 ms,   2267055252+-   121334592 Byte/s,    401301962+-    21477999 Byte/s
          Update&Apply Scheme Fused,    1.627+- 0.056 ms,   4916655958+-   173706513 Byte/s,    870320068+-    30748595 Byte/s
                       Apply Scheme,    1.836+- 0.052 ms,   4358519905+-   125944181 Byte/s,    771521818+-    22293964 Byte/s
            Update&Apply from Empty,    1.467+- 0.086 ms,   5451830796+-   298396674 Byte/s,     27356767+-     1497326 Byte/s
                 Normal Compression,   19.648+- 2.180 ms,    407180261+-    47363150 Byte/s,     72076866+-     8383971 Byte/s

This commit extends the performance Jar for measuring internal functions. In specific this commit adds functionality to measure bandwidth utilization of individual operations.

Baunsgaard force-pushed the SchemaApplyPerf branch 4 times, most recently from 965aaa7 to abebe83 Compare August 8, 2023 13:05

[SYSTEMDS-3601] SchemaApplyPerfTests

1bddf49

This commit extends the performance Jar for measuring internal functions. In specific this commit adds functionality to measure bandwidth utilization of individual operations.

Baunsgaard force-pushed the SchemaApplyPerf branch from abebe83 to 1bddf49 Compare August 8, 2023 13:18

Baunsgaard closed this in d5a49a1 Aug 8, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[SYSTEMDS-???] SchemaApply Performance Tests #1869

[SYSTEMDS-???] SchemaApply Performance Tests #1869

Baunsgaard commented Jul 24, 2023

Baunsgaard commented Jul 24, 2023

Baunsgaard commented Jul 25, 2023

Baunsgaard commented Jul 26, 2023

Baunsgaard commented Jul 31, 2023

Baunsgaard commented Aug 1, 2023 •

edited

Loading

[SYSTEMDS-???] SchemaApply Performance Tests #1869

[SYSTEMDS-???] SchemaApply Performance Tests #1869

Conversation

Baunsgaard commented Jul 24, 2023

Baunsgaard commented Jul 24, 2023

Baunsgaard commented Jul 25, 2023

Baunsgaard commented Jul 26, 2023

Baunsgaard commented Jul 31, 2023

Baunsgaard commented Aug 1, 2023 • edited Loading

Baunsgaard commented Aug 1, 2023 •

edited

Loading