Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[SYSTEMDS-???] SchemaApply Performance Tests #1869

Closed
wants to merge 1 commit into from

Conversation

Baunsgaard
Copy link
Contributor

This PR contains code to compare schema compression with our standard compression.
There are good indications that the schema apply is much faster, in the currently supported schemas.

In the following tables the update scheme takes the compressed scheme from one of the previously compressed blocks, and update the scheme to enable compression of new given blocks.
The apply scheme takes the scheme and applies to incoming uncompressed MatrixBlocks.
Update and Apply does both.
From Empty, takes an empty DDC single column scheme (one group per column) and materialize a new scheme for each block given and then applies it to the given MatrixBlock returning a compressed block.

Scaling number of unique values;

 SchemaTest  Repetitions: 100 rand(1000, 1000, 1, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.615, 3.789, 3.868, 4.063, 10.875],           
          Compress Normal 10 blocks,      [315.406, 320.236, 322.712, 325.224, 576.060],           
                      Update Scheme,               [0.641, 0.685, 0.727, 0.839, 28.160],           
                       Apply Scheme,                [0.001, 0.002, 0.002, 0.016, 0.123],           
              Update & Apply Scheme,                [0.651, 0.687, 0.711, 0.790, 2.026],           
   From Empty Update & Apply Scheme,           [16.994, 18.789, 19.555, 20.463, 36.124],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 2, 1.0) Seed: 42
                 Sum Task -- Warmup,                [3.602, 3.740, 3.806, 3.896, 6.162],           
          Compress Normal 10 blocks,          [56.112, 57.636, 57.688, 58.840, 259.717],           
                      Update Scheme,           [12.895, 13.360, 13.807, 14.775, 42.019],           
                       Apply Scheme,           [18.653, 19.696, 20.242, 20.479, 61.259],           
              Update & Apply Scheme,           [30.568, 32.189, 33.011, 34.520, 67.393],           
   From Empty Update & Apply Scheme,           [30.954, 32.023, 32.419, 33.195, 66.499],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 4, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.603, 3.789, 3.855, 3.898, 10.970],           
          Compress Normal 10 blocks,          [45.412, 47.698, 49.687, 50.207, 301.089],           
                      Update Scheme,           [17.774, 18.715, 19.329, 20.105, 54.371],           
                       Apply Scheme,           [18.380, 19.649, 20.470, 22.015, 58.429],           
              Update & Apply Scheme,          [32.330, 34.054, 35.110, 36.817, 109.912],           
   From Empty Update & Apply Scheme,           [21.964, 23.331, 23.737, 24.239, 47.941],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 8, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.566, 3.833, 3.903, 4.052, 13.631],           
          Compress Normal 10 blocks,          [57.973, 59.841, 65.434, 68.550, 231.100],           
                      Update Scheme,           [17.687, 19.650, 20.650, 21.923, 42.305],           
                       Apply Scheme,           [18.694, 20.381, 21.318, 22.833, 36.688],           
              Update & Apply Scheme,           [33.496, 35.713, 36.317, 38.844, 87.031],           
   From Empty Update & Apply Scheme,          [29.735, 31.057, 31.683, 32.681, 104.674],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 16, 1.0) Seed: 42
                 Sum Task -- Warmup,                [3.648, 3.832, 3.873, 3.953, 8.658],           
          Compress Normal 10 blocks,          [46.191, 47.782, 48.734, 49.146, 207.689],           
                      Update Scheme,           [14.298, 15.248, 15.669, 16.347, 43.943],           
                       Apply Scheme,           [12.059, 13.158, 13.662, 14.496, 35.420],           
              Update & Apply Scheme,           [26.971, 30.733, 31.230, 32.385, 74.089],           
   From Empty Update & Apply Scheme,          [39.140, 41.103, 42.154, 42.686, 102.477],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.576, 3.741, 3.804, 4.132, 13.777],           
          Compress Normal 10 blocks,          [78.753, 80.527, 81.096, 81.122, 225.091],           
                      Update Scheme,           [16.839, 18.050, 18.472, 18.996, 24.060],           
                       Apply Scheme,           [13.436, 14.956, 15.647, 16.289, 32.928],           
              Update & Apply Scheme,           [29.979, 31.771, 33.240, 33.991, 98.724],           
   From Empty Update & Apply Scheme,          [53.602, 58.724, 59.901, 60.409, 130.987],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 64, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.600, 3.855, 3.997, 4.534, 12.875],           
          Compress Normal 10 blocks,          [99.184, 99.643, 99.910, 99.974, 271.286],           
                      Update Scheme,           [16.941, 18.505, 18.938, 19.277, 47.379],           
                       Apply Scheme,           [15.055, 16.136, 16.628, 17.252, 54.517],           
              Update & Apply Scheme,           [30.287, 33.366, 34.128, 35.208, 97.795],           
   From Empty Update & Apply Scheme,          [70.528, 72.287, 73.163, 73.559, 161.978], 

Scaling number of columns

          SchemaTest  Repetitions: 100 rand(1000, 1, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [0.067, 0.087, 0.124, 0.154, 9.823],           
          Compress Normal 10 blocks,              [1.508, 1.540, 1.649, 1.720, 116.049],           
                      Update Scheme,                [0.054, 0.121, 0.177, 0.233, 0.530],           
                       Apply Scheme,                [0.090, 0.154, 0.313, 0.408, 0.851],           
              Update & Apply Scheme,                [0.083, 0.134, 0.138, 0.148, 0.575],           
   From Empty Update & Apply Scheme,                [0.202, 0.225, 0.254, 0.263, 9.700],           
          SchemaTest  Repetitions: 100 rand(1000, 10, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [0.182, 0.202, 0.214, 0.337, 0.707],           
          Compress Normal 10 blocks,             [6.051, 7.294, 12.022, 13.442, 35.295],           
                      Update Scheme,                [0.081, 0.380, 0.384, 0.397, 0.570],           
                       Apply Scheme,                [0.337, 0.351, 0.360, 0.385, 0.581],           
              Update & Apply Scheme,                [0.694, 0.716, 0.724, 0.738, 1.086],           
   From Empty Update & Apply Scheme,                [1.702, 1.727, 1.869, 1.963, 2.251],           
          SchemaTest  Repetitions: 100 rand(1000, 100, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,                [1.093, 1.506, 1.527, 1.639, 1.829],           
          Compress Normal 10 blocks,            [8.636, 12.370, 44.013, 51.628, 68.924],           
                      Update Scheme,                [0.892, 3.599, 3.769, 4.093, 4.498],           
                       Apply Scheme,                [3.112, 3.367, 3.493, 3.829, 7.980],           
              Update & Apply Scheme,               [4.004, 6.603, 6.927, 7.143, 13.654],           
   From Empty Update & Apply Scheme,               [3.804, 3.999, 4.067, 4.762, 22.411],           
          SchemaTest  Repetitions: 100 rand(1000, 1000, 32, 1.0) Seed: 42
                 Sum Task -- Warmup,               [3.608, 3.867, 4.056, 5.260, 12.220],           
          Compress Normal 10 blocks,          [70.259, 72.494, 74.850, 76.022, 165.416],           
                      Update Scheme,           [13.331, 13.997, 14.522, 15.124, 40.080],           
                       Apply Scheme,             [9.139, 9.610, 10.026, 10.510, 35.005],           
              Update & Apply Scheme,           [22.860, 24.072, 24.813, 25.692, 87.585],           
   From Empty Update & Apply Scheme,           [45.950, 47.302, 47.617, 48.151, 90.668], 

@Baunsgaard
Copy link
Contributor Author

Improve Column combining when the different statistics have equal cost (for instance if it is constants.)
Before:

          Compress Normal 10 blocks,      [315.406, 320.236, 322.712, 325.224, 576.060],

after:

          Compress Normal 10 blocks,      [106.500, 116.676, 122.512, 164.084, 433.367], 

@Baunsgaard
Copy link
Contributor Author

And a shortcut optimization:

Compress Normal 10 blocks,          [16.764, 18.058, 18.841, 19.047, 148.369],  

@Baunsgaard
Copy link
Contributor Author

Initial in memory performance test of the Compression and processing speed,
The results show that we can update and apply a compression scheme at high rate throughput.
The numbers indicate Input throughput and output throughput,
As an example Sum is performed at 7.9GB/s input and produce output at 8.33 KB/s

           WriteTest  Repetitions: 100 ConstMatrix 1000, 1000, 1.0)
      Warmup Sum task Single Thread, [   0.443,    0.462,    0.479,    0.567,   26.832],   7.94 GB/s In,   8.33 KB/s Out
          Warmup Sum task Parallel , [   0.422,    0.441,    0.511,    0.561,    3.409],  14.52 GB/s In,  15.22 KB/s Out
Compression In Memory Single Thread, [  55.041,   56.349,   56.940,   58.169,  232.053], 128.03 MB/s In,  22.66 MB/s Out
Compression In Memory Parallel     , [  15.457,   33.681,   47.115,   49.519,   54.410], 189.31 MB/s In,  33.51 MB/s Out
              Update & Apply Scheme, [  18.761,   20.947,   21.263,   21.495,   39.948], 358.98 MB/s In,  63.54 MB/s Out
                       Apply Scheme, [   8.682,    8.751,    8.815,    8.928,   11.203], 857.06 MB/s In, 151.71 MB/s Out
     Update & Apply Scheme Parallel, [   3.099,    3.567,    3.652,    3.759,    8.134],   2.02 GB/s In, 366.32 MB/s Out
              Apply Scheme Parallel, [   1.204,    1.513,    1.561,    1.600,    4.642],   4.72 GB/s In, 856.16 MB/s Out

FYI @mboehm7 , Do you think this is the right direction?

Following tests will confirm if this also works when writing to disk.

@Baunsgaard
Copy link
Contributor Author

Image showing the performance at different numbers of unique values.

image

@Baunsgaard
Copy link
Contributor Author

Baunsgaard commented Aug 1, 2023

On So010, with a 1000 x 1000 matrix and 48 threads in use:

we get 40+-8 GB input processed with MatrixVector multiplication with a peak MemoryBandwith of 200 GB

           WriteTest  Repetitions: 1000 ConstMatrix ( Rows:1000, Cols:1000, Spar:1.0, Unique: 32)
                                Sum,    0.401+- 0.045 ms,  19974765027+-  1877895561 Byte/s,       399488+-       37557 Byte/s
                            MV mult,    0.182+- 0.036 ms,  43905190543+-  8386160115 Byte/s,     44738539+-     8545335 Byte/s
                Update&Apply Scheme,    3.529+- 0.186 ms,   2267055252+-   121334592 Byte/s,    401301962+-    21477999 Byte/s
          Update&Apply Scheme Fused,    1.627+- 0.056 ms,   4916655958+-   173706513 Byte/s,    870320068+-    30748595 Byte/s
                       Apply Scheme,    1.836+- 0.052 ms,   4358519905+-   125944181 Byte/s,    771521818+-    22293964 Byte/s
            Update&Apply from Empty,    1.467+- 0.086 ms,   5451830796+-   298396674 Byte/s,     27356767+-     1497326 Byte/s
                 Normal Compression,   19.648+- 2.180 ms,    407180261+-    47363150 Byte/s,     72076866+-     8383971 Byte/s

@Baunsgaard Baunsgaard force-pushed the SchemaApplyPerf branch 4 times, most recently from 965aaa7 to abebe83 Compare August 8, 2023 13:05
This commit extends the performance Jar for measuring internal functions.
In specific this commit adds functionality to measure bandwidth utilization
of individual operations.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
Status: Done
Development

Successfully merging this pull request may close these issues.

1 participant