-
Notifications
You must be signed in to change notification settings - Fork 139
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Towards state-of-the-art multi-scalar-muls #163
Comments
How do you achieve recording on the fly while we need to iterate bucket indexes (ie slices of scalar) in reverse order? @mratsim |
Using Booth encoding, you only need to see the window and the previous bit. https://github.com/mratsim/constantine/blob/c97036d1df09b25afaddc040aed468a80df0c8d7/constantine/math/arithmetic/bigints.nim#L792-L829 func signedWindowEncoding(digit: SecretWord, bitsize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
## Get the signed window encoding for `digit`
##
## This uses the fact that 999 = 100 - 1
## It replaces string of binary 1 with 1...-1
## i.e. 0111 becomes 1 0 0 -1
##
## This looks at [bitᵢ₊ₙ..bitᵢ | bitᵢ₋₁]
## and encodes [bitᵢ₊ₙ..bitᵢ]
##
## Notes:
## - This is not a minimum weight encoding unlike NAF
## - Due to constant-time requirement in scalar multiplication
## or bucketing large window in multi-scalar-multiplication
## minimum weight encoding might not lead to saving operations
## - Unlike NAF and wNAF encoding, there is no carry to propagate
## hence this is suitable for parallelization without encoding precomputation
## and for GPUs
## - Implementation uses Booth encoding
result.neg = SecretBool(digit shr bitsize)
let negMask = -SecretWord(result.neg)
const valMask = SecretWord((1 shl bitsize) - 1)
let encode = (digit + One) shr 1 # Lookup bitᵢ₋₁, flip series of 1's
result.val = (encode + negMask) xor negMask # absolute value
result.val = result.val and valMask
func getSignedFullWindowAt*(a: BigInt, bitIndex: int, windowSize: static int): tuple[val: SecretWord, neg: SecretBool] {.inline.} =
## Access a signed window of `a` of size bitsize
## Returns a signed encoding.
##
## The result is `windowSize` bits at a time.
##
## bitIndex != 0 and bitIndex mod windowSize == 0
debug: doAssert (bitIndex != 0) and (bitIndex mod windowSize) == 0
let digit = a.getWindowAt(bitIndex-1, windowSize+1) # get the bit on the right of the window for Booth encoding
return digit.signedWindowEncoding(windowSize)
Usually NAF has the following advantages:
However the main appeal of window NAF is not applicable to MSM because as your window grow, it’s exponentially more likely that at least a bit is set and you will have an addition anyway. With on-the-fly recoding, you avoid allocating millions of recoded scalar. It's also GPU friendly. |
This is a mirror to the plan I laid out in the Discord #collaborate channel.
Goal:
Out-of-scope:
This might be very efficient if the compiler can do RVO (return value optimization) and copy elision, or can be problematic (unsure the compiler can do this with assembly)
unsure if alloca(+noinline) is possible in Rust
Reference PRs and benchmark of the changes:
Pasta / Halo2 MSM bench mratsim/constantine#243 (comment)
The steps
(mandatory +35% perf) Accelerate field multiplication.
(optional +10% perf) implement extended Jacobian coordinates. Their mixed-addition formula is 10 field mul instead of 11 field mul (see also this march paper from Gnark, Table 3, https://tches.iacr.org/index.php/TCHES/article/download/10972/10279 and BLST codebase). used as a fallback:
(recommended) implement fast inversion. Note that the algorithms initially recommended in Implement faster inversion #28 are asymptotically 4x slower than state-of-the-art (Bernstein-Yang and Pornin's bingcd) and in practice 5x slower. This allows aggregating just ~32 points instead of ~160 points minimum to amortize the cost of inversions when doing batch affine.
(mandatory, 1/2 buckets memory requirement) implement signed digit-recoding. wNAF that everyone uses is problematic, it requires precomputation and allocation of all the recoded scalar ahead of times. This is extra memory usage, extra latency, hard to port to GPU. wNAF has 2 draws: being signed recoding so using 2x less space than unsigned, and optimally maximizing the chance of adding zero, i.e. "no cost". But in MSM that second part basically never happens because the number of bits we look at at once grows up to c=16 (or k=16 in Halo2 usual denomiation). Instead Booth encoding can be used for to provide 1, with no precomputation, drastically reducing memory usage.
(mandatory, 70% speedup) Architecture MSM to use affine coordinate (6M asymptotic cost with Montgomery's inversion trick).
I saw that @Brechtpd in MSM optimization halo2#40 and @kilic in Msm optimization #29
are using Barretenberg approach with a radix sort.
I think the better high-level approach is:
FPGA Acceleration of Multi-Scalar Multiplication: CycloneMSM
Kaveh Aasaraai, Don Beaver, Emanuele Cesena, Rahul Maganti, Nicolas Stalder and Javier Varela, 2022
https://eprint.iacr.org/2022/1396.pdf
Takeaways: Scheduler idea + collision probability analysis
cuZK: Accelerating Zero-Knowledge Proof with A Faster Parallel Multi-Scalar Multiplication Algorithm on GPUs
Tao Lu, Chengkun Wei, Ruijing Yu, Chaochao Chen, Wenjing Fang, Lei Wang, Zeke Wang and Wenzhi Chen, 2022
https://eprint.iacr.org/2022/1321.pdf
Takeaway: Parallelism over buckets by considering MSM as a sparse Matrix-Vector operation.
The main issues of Barrentenberg approach is lots of precomputation, memory usage scales linearly (2x or 3x) with input size, a complex program flow, hard to port to GPU, a lot of cache misses which will necessarily cause bandwidth problems. See detailed analysis at https://gist.github.com/mratsim/27c78c71fd423f731615a91d237162c3#file-multi-scalar-mul-md
see writeup on an alternative: https://github.com/mratsim/constantine/blob/master/constantine/math/elliptic/ec_multi_scalar_mul_scheduler.nim
Main takeaway is memory usage scales with the number of threads, like 4kB per extra thread, instead of number of input points.
(optional, up to 40% speedup below 1024 points, 0% above) Endomorphism acceleration. In my benchmarks, the overhead of endomorphism recoding is not worth it after 2^10 = 1024 points, compared to directly processing those. see my MSMs strategies: https://github.com/mratsim/constantine/blob/151f284/constantine/math/elliptic/ec_multi_scalar_mul_parallel.nim#L532-L565
The text was updated successfully, but these errors were encountered: