Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[RFC] Blackboxing MSM and FFT - Hardware Accel API #216

Closed
mratsim opened this issue Sep 28, 2023 · 4 comments
Closed

[RFC] Blackboxing MSM and FFT - Hardware Accel API #216

mratsim opened this issue Sep 28, 2023 · 4 comments

Comments

@mratsim
Copy link

mratsim commented Sep 28, 2023

Goals

Following privacy-scaling-explorations/halo2curves#86
MSM and FFT have been moved to halo2curves following rationales in privacy-scaling-explorations/halo2curves#84

This allows the proof systems to evolve separately from the algebra backend, that could be:

This RFC maps the next step, blackboxing MSM and FFT in this repo so they can be swapped out by any provider of a C API.

This has the following benefits:

  • C is the lingua-franca for low-level, and the C ABI standardized, all languages can consume it. This gives maximum compatibility.
  • Calling C from Rust has zero overhead if the low-level API can be designed so that only std::mem::transmute is necessary
  • It is possible to reuse a library from another Rust ZK dialect (for example Arkworks) if they are ABI compatible
    and transmute it for Halo2.

Flexible low-level API design

For reference, these kind of glue APIs is usually called HAL for "Hardware Abstraction Layer" or "Hardware Acceleration Layer".

The proposed API is inspired by Computer Vision and Neural Network acceleration libraries which seems like the domains most similar in terms of constraints
while being the most mature:

  • Lots of open-source libraries, so composability is key.
  • Math-heavy (Linear-algebra) so interfaces can be defined by math (FFTs ...).
  • GPU or dedicated accelerators friendly.
  • Operations may build a computation graph
  • Many level of async/parallelism opportunities so enabling those levels is important (graph-level and operation level for example)

Example accelerator APIs from Image Processing and Deep Learning

in particular Machine Learning and Image Processing also requires FFT, and that can be a strong inspiration for our API.

The salient part of those API is the following:

  • They accept a "context" or "handle" pointer object for the engine.
    That context can be populated with all the metadata needed for the backend:

    • hardware features detection
    • threadpool metadata
    • Cuda metadata (streams, device(s), ...)
    • memory pools, ...
  • They may accept a "context" or "descriptor" object for the operation.
    That context may be populated with operation specific flags,

  • They may accept a "descriptor" object for memory layout of individual images/tensors.
    In image processing or machine learning, you can use zero-copy views over different subsets of a tensor/image using strides,
    see wrapping real and complex FFTs API from Numpy: https://github.com/SciNim/impulse/blob/26e25e7/impulse/fft/pocketfft.ni (wrapping https://gitlab.mpcdf.mpg.de/mtr/pocketfft/-/tree/cpp)

    In our case, the memory layout API is unneeded.

All those APIs also return library specific status code, in our case we can assume that the code cannot fail.
In particular out-of-memory will crash, which is already the case today.

Proposed API for MSM

The API is name-spaced with h2k for Halo2-KZG.
We use BN254 as an example

The C API for multi-scalar multiplication must have the following signature

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

ABI

For SNARKS/pairing-friendly curves, all implementations use the same low-level representation of field elements and elliptic curve elements, which makes the following ABI a defacto standard.

We describe the ABI on 64-bit machines with BN254 as example:

#define pBits 254       // The number of bits in the curve field modulus
#define rBits 254       // The number of bits in the curve order
#define WordBitWidth 64 // The number of bits in Halo2-KZG words

#define words_required(bits) ((bits+WordBitWidth-1)/WordBitWidth)

typedef struct { uint64_t limbs[words_required(rBits)]; } bn254_fr;
typedef struct { uint64_t limbs[words_required(pBits)]; } bn254_fp;

typedef struct { bn254_fp x, y; } bn254_g1_aff;
typedef struct { bn254_fp x, y; z; } bn254_g1_prj;

Compatibility

AMD, Intel and Nvidia GPUs

As far as I am aware, all cryptographic libraries (CPU/GPU) are using the same Montgomery magic constant in their representation and their limb-endianess is little-endian.

Furthermore we note that ALL GPUs ISA use little-endian words (Intel, AMD, Nvidia, Apple).

When words and limbs are both little-endian, whether we use 32-bit or 64-bit words will not change the binary representation of the whole data structure.

This means that there is no need for conversion between a little-endian machine (x86, ARM)
and a GPU even if words on one are 64-bit and for the other 32-bit.
We can just cast/transmute between CPU and GPU pointers. This saves time and memory.

We caveat this for curves for which the number of limbs on 32-bit is not the double of 64-bit.
This is the case for P224 (7 32-bit words, 4 64-bit words) only as far as I'm aware, and it's not an interesting curve.

FPGA

FPGA usually work in the canonical domain and use Barret Reduction instead of Montgomery so a conversion will be needed, hence endianness/zero-copy doesn't matter.

Apple GPUs, MIPS, RISC-V, WASM

While Apple GPU are also little-endian, there is one caveat, they do not support addition-with-carry or substraction-with-borrow and 32x32 -> 64 extended precision multiplication, at least not officially (unlike AMD, Intel and Nvidia GPUs).
Looking at the reverse-engineering effort from Asahi Linux (https://rosenzweig.io/), I have not seen those instructions either.
Hence for speed it's likely that an approach of using 9*29-bit limbs (instead of 8*32-bit) would yield a faster implementation and so Apple GPUs would need a conversion step.

This approach is also likely necessary for accelerating MIPS, RISC-V and WASM targets

Proposed async API

In some protocols we might want to process multiple MSMs/FFTs in parallel.
For example, batch KZG verification described here: https://github.com/ethereum/consensus-specs/blob/v1.4.0-beta.2/specs/deneb/polynomial-commitments.md#verify_blob_kzg_proof_batch

does the following (with e a pairing function, and [a]₁ the scalar a multiplied by the generator of 𝔾1)

e(∑ [rᵢ][proofᵢ]₁, [τ]₂) . e(∑[rᵢ]([commitmentᵢ]₁ - [eval_at_challengeᵢ]₁) + ∑[rᵢ][zᵢ][proofᵢ]₁, [-1]₂) = 1

We can launch the 3 MSMs in an async manner and await their readiness before doing the pairing.

We copy the Cuda API, for example cudaMemCpyAsync and just suffix the function with async and return an opaque future handle.

engine_future_t h2k_mycurve_multi_scalar_mul_async(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        int len)

The engine should provide a function wait that allows blocking until the result is ready.

void engine_wait(engine_t* engine_context, engine_future_t handle)

This should be flexible enough to wrap Cuda Streams and OpenCL events, see also my high-level description of:

In Halo2, wait MUST be called once and only once.
This allows memory reclamation of the handle and async task to be done before exiting wait.
It allows guarantees no double-free.

For flexibility in scheduling computation graphs, the future handle may escape its scope and be returned by the function that used msmAsync. The engine MUST support this use-case.

Init/shutdown

The engine should also provide an init and shutdown function and may add configuration options there like number of threads for CPU backends or target GPUs for GPU backends.

FFT

TBD

Naming

I propose we refer to the API as "Accel" in discussion as we might not want only hardware accelerator but also other software backends.

Future considerations

For better acceleration, teams are considering the whole prover on GPUs.
It would be interesting to know their feedback on the bottlenecks and if the async API covers that.

@JasonStarlight
Copy link

JasonStarlight commented Oct 1, 2023

Hi, this is great! It's a very good definition and description of the hardware API for MSM and FFT.
We also have some other ideas to share with you. This is our reference code on GitHub for the Halo2 hardware acceleration API.|
We access various hardware components, including some API and struct definitions, by establishing the "device manager" module. Additionally, we describe how halo2 interacts with the device manager module.

https://github.com/superscalar-io/halo2_device_sample/tree/master

We have tested the "device manager" module, and the results are correct and as expected. This project serves as an example with GPU devices, computational units for MSM and NTT. It can also support other devices and computational units.

Regarding ABI, in my personal view, Halo2 establishes the raw data format, and hardware manufacturers may tailor it to align with their hardware characteristics. Additionally, regardless of the coordinate system used for hardware acceleration, it is ultimately converted to Projective, which is in line with halo2. This approach is okay. However, it's important to be aware that there may be hidden risks related to coordinate system calculations and adaptations, such as the to_affine() operation. Therefore, extra consideration is needed in this area.

Welcome everyone to join the discussion, looking forward to your replies!
Thanks!

@einar-taiko
Copy link

Great work.

Regarding the computation graph - could you elaborate on what you model you had in mind here? I have a sense that the computation will need more info about the problem (e.g. #ops) but also on the available hardware to do any useful scheduling/ordering/distribution. Some of this info may have to be provided through the API.

Is it correct that we need to expose an async Rust wrapper, wrapping async C calls?

I would suggest size_t or ssize_t instead of int. E.g.

void h2k_mycurve_multi_scalar_mul(
        engine_t* engine_context,
        h2k_bn254_g1_prj* out,
        h2k_bn254_fr* scalars,
        h2k_bn254_g1_aff* points,
        size_t len)

See man 3 size_t.

@mratsim
Copy link
Author

mratsim commented Dec 5, 2023

A tentative trait has been proposed here privacy-scaling-explorations/halo2curves#107.

This is focused on stateless MSM to start the discussion and also as the biggest bottleneck. Further primitives would be:

  • cached MSM, where the bases are cached due minimize RAM <-> GPU/FPGA memcpy costs
  • FFTs
  • coset FFTs

An implementation tutorial is available here:

@adria0
Copy link
Member

adria0 commented Jun 10, 2024

Implemented in #277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants