-
Notifications
You must be signed in to change notification settings - Fork 129
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[RFC] Blackboxing MSM and FFT - Hardware Accel API #216
Comments
Hi, this is great! It's a very good definition and description of the hardware API for MSM and FFT. https://github.com/superscalar-io/halo2_device_sample/tree/master We have tested the "device manager" module, and the results are correct and as expected. This project serves as an example with GPU devices, computational units for MSM and NTT. It can also support other devices and computational units. Regarding ABI, in my personal view, Halo2 establishes the raw data format, and hardware manufacturers may tailor it to align with their hardware characteristics. Additionally, regardless of the coordinate system used for hardware acceleration, it is ultimately converted to Projective, which is in line with halo2. This approach is okay. However, it's important to be aware that there may be hidden risks related to coordinate system calculations and adaptations, such as the to_affine() operation. Therefore, extra consideration is needed in this area. Welcome everyone to join the discussion, looking forward to your replies! |
Great work. Regarding the computation graph - could you elaborate on what you model you had in mind here? I have a sense that the computation will need more info about the problem (e.g. #ops) but also on the available hardware to do any useful scheduling/ordering/distribution. Some of this info may have to be provided through the API. Is it correct that we need to expose an async Rust wrapper, wrapping async C calls? I would suggest
See |
A tentative trait has been proposed here privacy-scaling-explorations/halo2curves#107. This is focused on stateless MSM to start the discussion and also as the biggest bottleneck. Further primitives would be:
An implementation tutorial is available here: |
Implemented in #277 |
Goals
Following privacy-scaling-explorations/halo2curves#86
MSM and FFT have been moved to halo2curves following rationales in privacy-scaling-explorations/halo2curves#84
This allows the proof systems to evolve separately from the algebra backend, that could be:
This RFC maps the next step, blackboxing MSM and FFT in this repo so they can be swapped out by any provider of a C API.
This has the following benefits:
std::mem::transmute
is necessaryand transmute it for Halo2.
Flexible low-level API design
For reference, these kind of glue APIs is usually called HAL for "Hardware Abstraction Layer" or "Hardware Acceleration Layer".
The proposed API is inspired by Computer Vision and Neural Network acceleration libraries which seems like the domains most similar in terms of constraints
while being the most mature:
Example accelerator APIs from Image Processing and Deep Learning
for example Matrix or Tensor addition
C += αA
:https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnnAddTensor
for example Tensor Sum C = ∑ αᵢAᵢ
https://oneapi-src.github.io/oneDNN/group_dnnl_api_sum.html#details-group-dnnl-api-sum
https://oneapi-src.github.io/oneDNN/page_sum_example_cpp.html#doxid-sum-example-cpp
see RFC https://github.com/vinograd47/opencv-hal-proposal
in particular Machine Learning and Image Processing also requires FFT, and that can be a strong inspiration for our API.
The salient part of those API is the following:
They accept a "context" or "handle" pointer object for the engine.
That context can be populated with all the metadata needed for the backend:
They may accept a "context" or "descriptor" object for the operation.
That context may be populated with operation specific flags,
or the output.
actual number of bits used for special cases like MSM on binary values (feat:
multiexp_serial
skips doubling when all bits are zero #202 (comment))They may accept a "descriptor" object for memory layout of individual images/tensors.
In image processing or machine learning, you can use zero-copy views over different subsets of a tensor/image using strides,
see wrapping real and complex FFTs API from Numpy: https://github.com/SciNim/impulse/blob/26e25e7/impulse/fft/pocketfft.ni (wrapping https://gitlab.mpcdf.mpg.de/mtr/pocketfft/-/tree/cpp)
In our case, the memory layout API is unneeded.
All those APIs also return library specific status code, in our case we can assume that the code cannot fail.
In particular out-of-memory will crash, which is already the case today.
Proposed API for MSM
The API is name-spaced with h2k for Halo2-KZG.
We use BN254 as an example
The C API for multi-scalar multiplication must have the following signature
ABI
For SNARKS/pairing-friendly curves, all implementations use the same low-level representation of field elements and elliptic curve elements, which makes the following ABI a defacto standard.
We describe the ABI on 64-bit machines with BN254 as example:
Field elements are in Montgomery representation
a' = aR (mod M)
with:p
the curve field prime orr
the curve order)G1 projective elements use homogeneous projective coordinates:
While Jacobian are usually used in software libraries, Halo2 switched to
homogeneous projective in Switch to homogeneous coordinates + Add complete formulae halo2curves#19
Coincidentally it is possible that homogeneous coordinates are more hardware friendly, especially for pipelining in FPGA, because of complete formulas that require no branches (but we can do "complete" Jacobian formulas with conditional moves so ...)
Ingonyama uses projective coordinates: https://github.com/ingonyama-zk/icicle/blob/97f0079/icicle/primitives/projective.cuh
See also: https://eprint.iacr.org/2022/999.pdf
Sppark uses jacobian coordinates: https://github.com/supranational/sppark/blob/fffd734/ec/jacobian_t.hpp
conversion is cheap (2 muls + 1 square): https://github.com/mratsim/constantine/blob/f925853/constantine/math/ec_shortweierstrass.nim#L30-L36
Compatibility
AMD, Intel and Nvidia GPUs
As far as I am aware, all cryptographic libraries (CPU/GPU) are using the same Montgomery magic constant in their representation and their limb-endianess is little-endian.
Furthermore we note that ALL GPUs ISA use little-endian words (Intel, AMD, Nvidia, Apple).
When words and limbs are both little-endian, whether we use 32-bit or 64-bit words will not change the binary representation of the whole data structure.
This means that there is no need for conversion between a little-endian machine (x86, ARM)
and a GPU even if words on one are 64-bit and for the other 32-bit.
We can just cast/transmute between CPU and GPU pointers. This saves time and memory.
We caveat this for curves for which the number of limbs on 32-bit is not the double of 64-bit.
This is the case for P224 (7 32-bit words, 4 64-bit words) only as far as I'm aware, and it's not an interesting curve.
FPGA
FPGA usually work in the canonical domain and use Barret Reduction instead of Montgomery so a conversion will be needed, hence endianness/zero-copy doesn't matter.
Apple GPUs, MIPS, RISC-V, WASM
While Apple GPU are also little-endian, there is one caveat, they do not support addition-with-carry or substraction-with-borrow and 32x32 -> 64 extended precision multiplication, at least not officially (unlike AMD, Intel and Nvidia GPUs).
Looking at the reverse-engineering effort from Asahi Linux (https://rosenzweig.io/), I have not seen those instructions either.
Hence for speed it's likely that an approach of using
9*29-bit
limbs (instead of8*32-bit
) would yield a faster implementation and so Apple GPUs would need a conversion step.This approach is also likely necessary for accelerating MIPS, RISC-V and WASM targets
Proposed async API
In some protocols we might want to process multiple MSMs/FFTs in parallel.
For example, batch KZG verification described here: https://github.com/ethereum/consensus-specs/blob/v1.4.0-beta.2/specs/deneb/polynomial-commitments.md#verify_blob_kzg_proof_batch
does the following (with
e
a pairing function, and[a]₁
the scalara
multiplied by the generator of 𝔾1)We can launch the 3 MSMs in an async manner and await their readiness before doing the pairing.
We copy the Cuda API, for example
cudaMemCpyAsync
and just suffix the function with async and return an opaque future handle.The engine should provide a function
wait
that allows blocking until the result is ready.This should be flexible enough to wrap Cuda Streams and OpenCL events, see also my high-level description of:
In Halo2,
wait
MUST be called once and only once.This allows memory reclamation of the handle and async task to be done before exiting
wait
.It allows guarantees no double-free.
For flexibility in scheduling computation graphs, the future handle may escape its scope and be returned by the function that used
msmAsync
. The engine MUST support this use-case.Init/shutdown
The engine should also provide an init and shutdown function and may add configuration options there like number of threads for CPU backends or target GPUs for GPU backends.
FFT
TBD
Naming
I propose we refer to the API as "Accel" in discussion as we might not want only hardware accelerator but also other software backends.
Future considerations
For better acceleration, teams are considering the whole prover on GPUs.
It would be interesting to know their feedback on the bottlenecks and if the async API covers that.
The text was updated successfully, but these errors were encountered: