You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Currently the MSM scheduler is minimizing the global number of additions and doublings.
However to benefits from maximum parallelism it might be worthwhile to minimize per-thread number of additions and doublings even if there are slightly more globally.
Motivating example:
256 inputs require c = 7
After endomorphism acceleration we have coefficients of 128 bits hence 128/7 = 18.29 mini-MSMs.
On a 16 threads machine, you would wait for 2 rounds of mini-MSMs with 15 out of 18 threads idle at the second round.
This can be fixed with latency hiding but you can only do so-much if the imbalance is that large.
Here moving to c = 8 for an exact 16-level parallelization or c = 4 for 32 would better utilize the cores.
Note that if cores are not homogeneous with one 3x faster than the other, we're at a loss with exact work division.
The text was updated successfully, but these errors were encountered:
Currently the MSM scheduler is minimizing the global number of additions and doublings.
However to benefits from maximum parallelism it might be worthwhile to minimize per-thread number of additions and doublings even if there are slightly more globally.
Motivating example:
The text was updated successfully, but these errors were encountered: