Towards state-of-the-art multi-scalar-muls #163

mratsim · 2023-06-15T16:15:25Z

This is a mirror to the plan I laid out in the Discord #collaborate channel.

Goal:

Be within 15% of the fastest CPU MSMs.

Out-of-scope:

Rayon performance/multithreading performance, I have yet to look at rayon multithreading implementation. For efficient reduction, we would in particular need futures support in Rayon to enable latency-hiding techniques. The final solution will expose parallelism at 3 levels to benefit from CPUs with very high core count.
verification, it's not the main bottleneck, with very involved steps reworking the towering, frobenius for final exponentiation, Miller loop, you can get a 12x perf improvement on verification between the old zcash/bn implementation and an optimized one (see Create a Groth16 verifier based on the Constantine library metacraft-labs/DendrETH#115 (comment), nim-bncurve is a 1-1 port of zcash/bn). I can detail those if wanted.
solve some inherent low-level inefficiencies (account for those potential 15% missed perf):
- field operations return a field element instead of in-place modification.
  This might be very efficient if the compiler can do RVO (return value optimization) and copy elision, or can be problematic (unsure the compiler can do this with assembly)
- There is significant efficiency gains if we can avoid heap allocation, in particular in batch affine inversion, to avoid stressing the allocator.
  unsure if alloca(+noinline) is possible in Rust

Reference PRs and benchmark of the changes:

Single-threaded, vs Gnark and BLST: Multi-Scalar-Multiplication / Linear combination mratsim/constantine#220
Multi-threaded, 8 cores vs Arkworks, Barretenberg, Bellman, Bellman CE, BLST, Gnark: Parallel Multi-Scalar-Multiplication mratsim/constantine#226 (comment)
Multi-threaded, 18 cores vs Barretenberg, Bellman CE, blstrs, Gnark, note that the linked benchmark perf was improved by 50% by changing the multithreaded barrier algorithm so multithreaded performance is highly dependent on the threadpool/rayon implementation MSM tuning for high core count mratsim/constantine#227 (comment)
Multithreaded, pasta curves vs https://github.com/supranational/pasta-msm:
Pasta / Halo2 MSM bench mratsim/constantine#243 (comment)

The steps

(mandatory +35% perf) Accelerate field multiplication.
- Done in Field arithmetic benchmark + reworked assembly (+40% perf) #49
- Do we need the same in https://github.com/privacy-scaling-explorations/pairing?
(optional +10% perf) implement extended Jacobian coordinates. Their mixed-addition formula is 10 field mul instead of 11 field mul (see also this march paper from Gnark, Table 3, https://tches.iacr.org/index.php/TCHES/article/download/10972/10279 and BLST codebase). used as a fallback:
- When the MSM is too small for affine addition
- When too many points need to be accumulated in the same bucket and our temp buffer overflows.
(recommended) implement fast inversion. Note that the algorithms initially recommended in Implement faster inversion #28 are asymptotically 4x slower than state-of-the-art (Bernstein-Yang and Pornin's bingcd) and in practice 5x slower. This allows aggregating just ~32 points instead of ~160 points minimum to amortize the cost of inversions when doing batch affine.
(mandatory, 1/2 buckets memory requirement) implement signed digit-recoding. wNAF that everyone uses is problematic, it requires precomputation and allocation of all the recoded scalar ahead of times. This is extra memory usage, extra latency, hard to port to GPU. wNAF has 2 draws: being signed recoding so using 2x less space than unsigned, and optimally maximizing the chance of adding zero, i.e. "no cost". But in MSM that second part basically never happens because the number of bits we look at at once grows up to c=16 (or k=16 in Halo2 usual denomiation). Instead Booth encoding can be used for to provide 1, with no precomputation, drastically reducing memory usage.
(mandatory, 70% speedup) Architecture MSM to use affine coordinate (6M asymptotic cost with Montgomery's inversion trick).
I saw that @Brechtpd in MSM optimization halo2#40 and @kilic in Msm optimization #29
are using Barretenberg approach with a radix sort.
I think the better high-level approach is:
- the scheduler approach from
  FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
  Kaveh Aasaraai, Don Beaver, Emanuele Cesena, Rahul Maganti, Nicolas Stalder and Javier Varela, 2022
  https://eprint.iacr.org/2022/1396.pdf
  Takeaways: Scheduler idea + collision probability analysis
- combined with parallelism over buckets from Barretenberg's TurboPlonk (without radix sort) or
  cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs
  Tao Lu, Chengkun Wei, Ruijing Yu, Chaochao Chen, Wenjing Fang, Lei Wang, Zeke Wang and Wenzhi Chen, 2022
  https://eprint.iacr.org/2022/1321.pdf
  Takeaway: Parallelism over buckets by considering MSM as a sparse Matrix-Vector operation.
- and Constantine specific memory and memory-bandwidth optimizations:
  - on-the-fly signed digit recoding with zero allocation
  - no allocation/sorting and very limited temp memory to map points to buckets: 8 bytes per point, max ~600-700 points, per thread.
The main issues of Barrentenberg approach is lots of precomputation, memory usage scales linearly (2x or 3x) with input size, a complex program flow, hard to port to GPU, a lot of cache misses which will necessarily cause bandwidth problems. See detailed analysis at https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md

see writeup on an alternative: https://github.com/mratsim/constantine/blob/master/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim
Main takeaway is memory usage scales with the number of threads, like 4kB per extra thread, instead of number of input points.
(optional, up to 40% speedup below 1024 points, 0% above) Endomorphism acceleration. In my benchmarks, the overhead of endomorphism recoding is not worth it after 2^10 = 1024 points, compared to directly processing those. see my MSMs strategies: https://github.com/mratsim/constantine/blob/151f284/constantine/math/elliptic/ec_multi_scalar_mul_parallel.nim#L532-L565

kilic · 2023-10-06T08:17:09Z

on-the-fly signed digit recoding with zero allocation

How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order? @mratsim

mratsim · 2023-10-06T10:22:04Z

on-the-fly signed digit recoding with zero allocation

How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order?

Using Booth encoding, you only need to see the window and the previous bit. https://github.com/mratsim/constantine/blob/c97036d1df09b25afaddc040aed468a80df0c8d7/constantine/math/arithmetic/bigints.nim#L792-L829

func signedWindowEncoding(digit: SecretWord, bitsize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Get the signed window encoding for `digit`
  ##
  ## This uses the fact that 999 = 100 - 1
  ## It replaces string of binary 1 with 1...-1
  ## i.e. 0111 becomes 1 0 0 -1
  ##
  ## This looks at [bitᵢ₊ₙ..bitᵢ | bitᵢ₋₁]
  ## and encodes   [bitᵢ₊ₙ..bitᵢ]
  ##
  ## Notes:
  ##   - This is not a minimum weight encoding unlike NAF
  ##   - Due to constant-time requirement in scalar multiplication
  ##     or bucketing large window in multi-scalar-multiplication
  ##     minimum weight encoding might not lead to saving operations
  ##   - Unlike NAF and wNAF encoding, there is no carry to propagate
  ##     hence this is suitable for parallelization without encoding precomputation
  ##     and for GPUs
  ##   - Implementation uses Booth encoding
  result.neg = SecretBool(digit shr bitsize)

  let negMask = -SecretWord(result.neg)
  const valMask = SecretWord((1 shl bitsize) - 1)

  let encode = (digit + One) shr 1            # Lookup bitᵢ₋₁, flip series of 1's
  result.val = (encode + negMask) xor negMask # absolute value
  result.val = result.val and valMask

func getSignedFullWindowAt*(a: BigInt, bitIndex: int, windowSize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Access a signed window of `a` of size bitsize
  ## Returns a signed encoding.
  ##
  ## The result is `windowSize` bits at a time.
  ##
  ## bitIndex != 0 and bitIndex mod windowSize == 0
  debug: doAssert (bitIndex != 0) and (bitIndex mod windowSize) == 0
  let digit = a.getWindowAt(bitIndex-1, windowSize+1) # get the bit on the right of the window for Booth encoding
  return digit.signedWindowEncoding(windowSize)

Usually NAF has the following advantages:

The main appeal of windowed NAF is reducing the average Hamming Weight of random scalar from 1/2 (random) to 1/3 (NAF) or even less depending on window size. Meaning there are more zeros so we skip additions.
Negation is cheap for elliptic curves, so with a signed digit representation you don't need to store the additive inverse. Hence you can divide by 2 the precomputed table requirement. Hence a window of size 5 would need 2⁴ = 16 elements instead of 32 if unsigned.

However the main appeal of window NAF is not applicable to MSM because as your window grow, it’s exponentially more likely that at least a bit is set and you will have an addition anyway.

With on-the-fly recoding, you avoid allocating millions of recoded scalar. It's also GPU friendly.

einar-taiko mentioned this issue Jun 26, 2023

extended Jacobian coordinates taikoxyz/zkevm-circuits#121

Open

einar-taiko mentioned this issue Aug 24, 2023

Implement signed digit-recoding taikoxyz/zkevm-circuits#141

Open

unbalancedparentheses mentioned this issue Aug 27, 2023

How to bring MSMs Halo2-KZG within 10% of Constantine lambdaclass/lambdaworks#531

Closed

This was referenced Aug 29, 2023

Fast modular inverse - 9.4x acceleration #83

Merged

RFC: Move MSM and FFT in this repo and offer a standard interface #84

Closed

einar-taiko mentioned this issue Sep 8, 2023

Insert MSM and FFT code and their benchmarks. #86

Merged

PatStiles mentioned this issue Nov 3, 2023

[FEAT]: Utilize glv decomposition to speed up MSM operations ingonyama-zk/icicle#260

Closed

kilic mentioned this issue Nov 28, 2023

Booth encoding #106

Merged

MauroToscano mentioned this issue Dec 22, 2023

MSM Optimizations lambdaclass/lambdaworks#730

Open

2 tasks

kilic mentioned this issue Jan 23, 2024

MSM optimisations: CycloneMSM #130

Merged

duguorong009 mentioned this issue Mar 9, 2024

Analyze and select optimizations to port from C++ port of Halo2 by kroma-network/tachyon privacy-scaling-explorations/halo2#293

Open

davidnevadoc transferred this issue from privacy-scaling-explorations/halo2 Jun 11, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Towards state-of-the-art multi-scalar-muls #163

Towards state-of-the-art multi-scalar-muls #163

mratsim commented Jun 15, 2023 •

edited by davidnevadoc

Loading

kilic commented Oct 6, 2023

mratsim commented Oct 6, 2023 •

edited

Loading

Towards state-of-the-art multi-scalar-muls #163

Towards state-of-the-art multi-scalar-muls #163

Comments

mratsim commented Jun 15, 2023 • edited by davidnevadoc Loading

Reference PRs and benchmark of the changes:

The steps

kilic commented Oct 6, 2023

mratsim commented Oct 6, 2023 • edited Loading

mratsim commented Jun 15, 2023 •

edited by davidnevadoc

Loading

mratsim commented Oct 6, 2023 •

edited

Loading