Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Towards state-of-the-art multi-scalar-muls #163

Open
3 of 6 tasks
mratsim opened this issue Jun 15, 2023 · 2 comments
Open
3 of 6 tasks

Towards state-of-the-art multi-scalar-muls #163

mratsim opened this issue Jun 15, 2023 · 2 comments

Comments

@mratsim
Copy link
Contributor

mratsim commented Jun 15, 2023

This is a mirror to the plan I laid out in the Discord #collaborate channel.

Goal:

  • Be within 15% of the fastest CPU MSMs.

Out-of-scope:

  • Rayon performance/multithreading performance, I have yet to look at rayon multithreading implementation. For efficient reduction, we would in particular need futures support in Rayon to enable latency-hiding techniques. The final solution will expose parallelism at 3 levels to benefit from CPUs with very high core count.
  • verification, it's not the main bottleneck, with very involved steps reworking the towering, frobenius for final exponentiation, Miller loop, you can get a 12x perf improvement on verification between the old zcash/bn implementation and an optimized one (see Create a Groth16 verifier based on the Constantine library metacraft-labs/DendrETH#115 (comment), nim-bncurve is a 1-1 port of zcash/bn). I can detail those if wanted.
  • solve some inherent low-level inefficiencies (account for those potential 15% missed perf):
    • field operations return a field element instead of in-place modification.
      This might be very efficient if the compiler can do RVO (return value optimization) and copy elision, or can be problematic (unsure the compiler can do this with assembly)
    • There is significant efficiency gains if we can avoid heap allocation, in particular in batch affine inversion, to avoid stressing the allocator.
      unsure if alloca(+noinline) is possible in Rust

Reference PRs and benchmark of the changes:

The steps

  1. (mandatory +35% perf) Accelerate field multiplication.

  2. (optional +10% perf) implement extended Jacobian coordinates. Their mixed-addition formula is 10 field mul instead of 11 field mul (see also this march paper from Gnark, Table 3, https://tches.iacr.org/index.php/TCHES/article/download/10972/10279 and BLST codebase). used as a fallback:

    • When the MSM is too small for affine addition
    • When too many points need to be accumulated in the same bucket and our temp buffer overflows.
  3. (recommended) implement fast inversion. Note that the algorithms initially recommended in Implement faster inversion #28 are asymptotically 4x slower than state-of-the-art (Bernstein-Yang and Pornin's bingcd) and in practice 5x slower. This allows aggregating just ~32 points instead of ~160 points minimum to amortize the cost of inversions when doing batch affine.

  4. (mandatory, 1/2 buckets memory requirement) implement signed digit-recoding. wNAF that everyone uses is problematic, it requires precomputation and allocation of all the recoded scalar ahead of times. This is extra memory usage, extra latency, hard to port to GPU. wNAF has 2 draws: being signed recoding so using 2x less space than unsigned, and optimally maximizing the chance of adding zero, i.e. "no cost". But in MSM that second part basically never happens because the number of bits we look at at once grows up to c=16 (or k=16 in Halo2 usual denomiation). Instead Booth encoding can be used for to provide 1, with no precomputation, drastically reducing memory usage.

  5. (mandatory, 70% speedup) Architecture MSM to use affine coordinate (6M asymptotic cost with Montgomery's inversion trick).
    I saw that @Brechtpd in MSM optimization halo2#40 and @kilic in Msm optimization #29
    are using Barretenberg approach with a radix sort.
    I think the better high-level approach is:

    • the scheduler approach from
      FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
      Kaveh Aasaraai, Don Beaver, Emanuele Cesena, Rahul Maganti, Nicolas Stalder and Javier Varela, 2022
      https://eprint.iacr.org/2022/1396.pdf
      Takeaways: Scheduler idea + collision probability analysis
    • combined with parallelism over buckets from Barretenberg's TurboPlonk (without radix sort) or
      cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs
      Tao Lu, Chengkun Wei, Ruijing Yu, Chaochao Chen, Wenjing Fang, Lei Wang, Zeke Wang and Wenzhi Chen, 2022
      https://eprint.iacr.org/2022/1321.pdf
      Takeaway: Parallelism over buckets by considering MSM as a sparse Matrix-Vector operation.
    • and Constantine specific memory and memory-bandwidth optimizations:
      • on-the-fly signed digit recoding with zero allocation
      • no allocation/sorting and very limited temp memory to map points to buckets: 8 bytes per point, max ~600-700 points, per thread.

    The main issues of Barrentenberg approach is lots of precomputation, memory usage scales linearly (2x or 3x) with input size, a complex program flow, hard to port to GPU, a lot of cache misses which will necessarily cause bandwidth problems. See detailed analysis at https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md

    see writeup on an alternative: https://github.com/mratsim/constantine/blob/master/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim
    Main takeaway is memory usage scales with the number of threads, like 4kB per extra thread, instead of number of input points.

  6. (optional, up to 40% speedup below 1024 points, 0% above) Endomorphism acceleration. In my benchmarks, the overhead of endomorphism recoding is not worth it after 2^10 = 1024 points, compared to directly processing those. see my MSMs strategies: https://github.com/mratsim/constantine/blob/151f284/constantine/math/elliptic/ec_multi_scalar_mul_parallel.nim#L532-L565

@kilic
Copy link
Collaborator

kilic commented Oct 6, 2023

on-the-fly signed digit recoding with zero allocation

How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order? @mratsim

@mratsim
Copy link
Contributor Author

mratsim commented Oct 6, 2023

on-the-fly signed digit recoding with zero allocation

How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order?

Using Booth encoding, you only need to see the window and the previous bit. https://github.com/mratsim/constantine/blob/c97036d1df09b25afaddc040aed468a80df0c8d7/constantine/math/arithmetic/bigints.nim#L792-L829

func signedWindowEncoding(digit: SecretWord, bitsize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Get the signed window encoding for `digit`
  ##
  ## This uses the fact that 999 = 100 - 1
  ## It replaces string of binary 1 with 1...-1
  ## i.e. 0111 becomes 1 0 0 -1
  ##
  ## This looks at [bitᵢ₊ₙ..bitᵢ | bitᵢ₋₁]
  ## and encodes   [bitᵢ₊ₙ..bitᵢ]
  ##
  ## Notes:
  ##   - This is not a minimum weight encoding unlike NAF
  ##   - Due to constant-time requirement in scalar multiplication
  ##     or bucketing large window in multi-scalar-multiplication
  ##     minimum weight encoding might not lead to saving operations
  ##   - Unlike NAF and wNAF encoding, there is no carry to propagate
  ##     hence this is suitable for parallelization without encoding precomputation
  ##     and for GPUs
  ##   - Implementation uses Booth encoding
  result.neg = SecretBool(digit shr bitsize)

  let negMask = -SecretWord(result.neg)
  const valMask = SecretWord((1 shl bitsize) - 1)

  let encode = (digit + One) shr 1            # Lookup bitᵢ₋₁, flip series of 1's
  result.val = (encode + negMask) xor negMask # absolute value
  result.val = result.val and valMask

func getSignedFullWindowAt*(a: BigInt, bitIndex: int, windowSize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
  ## Access a signed window of `a` of size bitsize
  ## Returns a signed encoding.
  ##
  ## The result is `windowSize` bits at a time.
  ##
  ## bitIndex != 0 and bitIndex mod windowSize == 0
  debug: doAssert (bitIndex != 0) and (bitIndex mod windowSize) == 0
  let digit = a.getWindowAt(bitIndex-1, windowSize+1) # get the bit on the right of the window for Booth encoding
  return digit.signedWindowEncoding(windowSize)

Usually NAF has the following advantages:

  1. The main appeal of windowed NAF is reducing the average Hamming Weight of random scalar from 1/2 (random) to 1/3 (NAF) or even less depending on window size. Meaning there are more zeros so we skip additions.
  2. Negation is cheap for elliptic curves, so with a signed digit representation you don't need to store the additive inverse. Hence you can divide by 2 the precomputed table requirement. Hence a window of size 5 would need 2⁴ = 16 elements instead of 32 if unsigned.

However the main appeal of window NAF is not applicable to MSM because as your window grow, it’s exponentially more likely that at least a bit is set and you will have an addition anyway.

With on-the-fly recoding, you avoid allocating millions of recoded scalar. It's also GPU friendly.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants