Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

ZK Abstraction Layer - PoC #107

Closed

Conversation

mratsim
Copy link
Contributor

@mratsim mratsim commented Nov 29, 2023

PoC branch of the ZK Abstraction Layer.

Next steps:

  • propagate the change up to Halo2 and see how intrusive they are.
  • Implement a non-Rust ZAL backend to show how to glue accelerators with Halo2.

Out-of-scope

We keep the PoC simple and effective so that it can serve as a basis for discussion:

  • in terms of API
  • in terms of release management as this require breaking all users API by introducing everywhere a ZAL Engine parameter. I.e. we might want to:
    • tag a release with all the improvements made and quick win pending so an "old interface" release, possibly LTS is done (say 6 month transition period).
    • tag a release specifically just for this for Halo2 and Halo2curves.

Hence the following are out-of-scope:

  • MSM witch cached bases
  • FFTs
  • coset FFTs
  • async API

@mratsim
Copy link
Contributor Author

mratsim commented Dec 4, 2023

I can confirm that the current ZAL API proposal works with a non-Rust ZAL backend, in this case Constantine

I've created a document for accelerator providers that outline out to integrate halo2curves ZAL with a C backend: https://github.com/mratsim/constantine/blob/master/docs/zk_accel_layer.md

@mratsim
Copy link
Contributor Author

mratsim commented Dec 5, 2023

I've added a caching API for discussion with opaque handles to the coefficients and base of the MSM.

halo2curves/src/zal.rs

Lines 42 to 81 in 014cf86

// The ZK Accel Layer API
// ---------------------------------------------------
pub trait ZalEngine {}
pub trait MsmAccel<C: CurveAffine>: ZalEngine {
fn msm(&self, coeffs: &[C::Scalar], base: &[C]) -> C::Curve;
// Caching API
// -------------------------------------------------
// From here we propose an extended API
// that allows reusing coeffs and/or the base points
//
// This is inspired by CuDNN API (Nvidia GPU)
// and oneDNN API (CPU, OpenCL) https://docs.nvidia.com/deeplearning/cudnn/api/index.html#cudnn-ops-infer-so-opaque
// usage of descriptors
//
// https://github.com/oneapi-src/oneDNN/blob/master/doc/programming_model/basic_concepts.md
//
// Descriptors are opaque pointers that hold the input in a format suitable for the accelerator engine.
// They may be:
// - Input moved on accelerator device (only once for repeated calls)
// - Endianess conversion
// - Converting from Montgomery to Canonical form
// - Input changed from Projective to Jacobian coordinates or even to a Twisted Edwards curve.
// - other form of expensive preprocessing
type CoeffsDescriptor<'c>;
type BaseDescriptor<'b>;
fn get_coeffs_descriptor<'c>(&self, coeffs: &'c [C::Scalar]) -> Self::CoeffsDescriptor<'c>;
fn get_base_descriptor<'b>(&self, base: &'b [C]) -> Self::BaseDescriptor<'b>;
fn msm_with_cached_scalars(&self, coeffs: &Self::CoeffsDescriptor<'_>, base: &[C]) -> C::Curve;
fn msm_with_cached_base(&self, coeffs: &[C::Scalar], base: &Self::BaseDescriptor<'_>) -> C::Curve;
fn msm_with_cached_inputs(&self, coeffs: &Self::CoeffsDescriptor<'_>, base: &Self::BaseDescriptor<'_>) -> C::Curve;
// Execute MSM according to descriptors
// Unsure of naming, msm_with_cached_inputs, msm_apply, msm_cached, msm_with_descriptors, ...
}

I don't know if caching scalars is useful, it can be removed if not.

I expect the following use cases for computations using the same inputs:

  • caching on accelerator device (i.e. avoiding copies)
  • caching Endianness conversion
  • caching conversion from Montgomery to Canonical form
  • caching conversion from Projective to Jacobian coordinates or even to a Twisted Edwards curve.
  • caching other form of expensive preprocessing

@mratsim
Copy link
Contributor Author

mratsim commented Dec 5, 2023

One thing missing here is error handling, for example because the GPU ran out of memory or GPU device 10 is unavailable.

At Halo2 and Halo2curves level it seems to me like those are unrecoverable so I think the trait impl should just panic.

@mratsim
Copy link
Contributor Author

mratsim commented Dec 11, 2023

Looking at how Ingonyama integrated with EZKL fork of halo2, it seems like HW accel is only worth starting from a certain size threshold:

https://github.com/zkonduit/halo2/pull/3/files

pub fn should_use_cpu_msm(size: usize) -> bool {
    size <= (1 << u8::from_str_radix(&env::var("ICICLE_SMALL_K").unwrap_or("8".to_string()), 10).unwrap())
}

Instead of increasing the noise in the prover code, the msm function can do this internally to pick whether to use the builtin Halo2 MSM or HW accel.

cc @jeremyfelder

@jeremyfelder
Copy link

it seems like HW accel is only worth starting from a certain size threshold:

@mratsim I wouldn't say this is true across the board. For now, our MSM algorithm isn't quite as performant on smaller circuits. We do have a batching method to perform multiple MSMs in one call to GPU which is performant for smaller circuits and this env variable is toggling where that cutoff is.

Instead of increasing the noise in the prover code, the msm function can do this internally to pick whether to use the builtin Halo2 MSM or HW accel.

Its possible for the msm function implemented for a specific acceleration to do this internally and fallback to CPU if needed; we essentially did this within the EZKL integration as well.

The only caveat I'll mention is using our batch MSM usually requires integration at a level further up the call stack, although I think it is possible to bring it into the msm function as well by writing any commitments as if it will be a batch and letting the msm function dictate if it uses batch or single depending on some user defined configuration like SMALL_CIRCUIT or NUM_{ADVICE|LOOKUPS|etc}

See here for an example of moving to batched operations

@mratsim
Copy link
Contributor Author

mratsim commented Jan 12, 2024

Turns out the caching API is somewhat problematic because of dyn trait, see https://rust-lang.github.io/async-fundamentals-initiative/evaluation/challenges/dyn_traits.html

image

In the current full PoC taikoxyz/halo2#14, I use dyn Trait so that the code is not monomorphized and if using Halo2 as a library it the ZalEngine can be changed without recompilation.

Alternatives:

  • Don't use &dyn MsmAccel<C> and use &impl MsmAccel<C>. Tradeoff: codesize, compiling the same code once per backend, cannot use backends as "plugins" when using Halo2 as a .dll/.so.
  • Remove the caching API. Tradeoff: it's a new feature so we don't lose anything. it may be possible to reintroduce it via boxing the descriptors.

@CPerezz
Copy link
Member

CPerezz commented Jan 17, 2024

First round of thoughts from @ed255 and me: https://hackmd.io/F6W1U6X8SLCqTciJt0nfiQ?edit

Happy to discuss more in a meeting or whatever works best

src/zal.rs Outdated
//! - an initialization function:
//! - either "fn new() -> ZalEngine" for simple libraries
//! - or a builder pattern for complex initializations
//! - a shutdown function.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What about handling shutdown via Drop?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, as discussed it's what makes the most sense.
In general this is a recommendation to document how the engine should be destroyed.
If the end-user doesn't need to do anything, even better.

src/zal.rs Outdated Show resolved Hide resolved
src/zal.rs Outdated
Comment on lines 68 to 69
type CoeffsDescriptor<'c>;
type BaseDescriptor<'b>;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Could these descriptors be a non-associated type? For example:

struct CoeffsDescriptor(u64);
struct BaseDescriptor(u64);

And then the engine holds a mapping between the ids and a possible internal representation of the descriptors? This way we simplify this Trait. Also, I'm not sure if associated types are compatible with trait objects, which may be a candidate on how to use MsmAccel.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It's a possibility, but that requires engines to hold state. While Rust has hashmaps in its standard library, in C or C++ it becomes annoying.

I think it's easier for now to use associated types and remove the &dyn to use &impl.


fn get_coeffs_descriptor<'c>(&self, coeffs: &'c [C::Scalar]) -> Self::CoeffsDescriptor<'c>{
// Do expensive device/library specific preprocessing here
Self::CoeffsDescriptor { raw: coeffs }
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think it would be hard to implement this following my previous comment. I was assuming the engine would copy the coeffs and base (and possibly transform them), but in this case it's just a reference.

Maybe with a Box<dyn Descriptor<'a>>, where the Descriptor trait is empty, and then the engine just casts it to the internal type?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think there's a really interesting article related to that! https://smallcultfollowing.com/babysteps/blog/2022/03/29/dyn-can-we-make-dyn-sized/

In any case, we might be able to not require the box if Descriptor is Copy right? Which is the only ugly part of this.

Copy link
Contributor Author

@mratsim mratsim Jan 22, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In this case it is just a reference, but it's an artifact of the H2cEngine being CPU so no copies are required.

For GPU and FPGA engines, there will be a copy to the device memory except for Apple Metal due to Unified memory.
That said, once copied we will likely still hold an (owned) reference as we would be in C FFI land.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In any case, we might be able to not require the box if Descriptor is Copy right? Which is the only ugly part of this.

For me this makes sense, I don't see a use-case for non-Copy destructors

        The most complex descriptors should be:
        - Pointer(s) to data
        - size of data
        - data in layout (for example canonical or bit-reversed permuted)
        - data out layout
        - enum or integer ID for operation (MSM, FFT, Coset FFT)
        - Device(s) ID

and if those are needed, maybe those data structures should live in a hashmap in &engine, and the &engine just returns a handle ID as descriptor.

src/zal.rs Show resolved Hide resolved
pub trait ZalEngine {}

pub trait MsmAccel<C: CurveAffine>: ZalEngine {
fn msm(&self, coeffs: &[C::Scalar], base: &[C]) -> C::Curve;
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The user of this trait can either use this method, or the cached version. Perhaps this method could have a default implementation that uses the cached version underneath?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think that can be a future optimization/discussion point later with more provider feedback.

Implementers always start with the stateless msm and later when they hit bottlenecks, they implement the cached version. So they'll likely focus on MSM and have the cached version fallback on stateless msm like I did in:

halo2curves/src/zal.rs

Lines 117 to 127 in 2e6b89a

fn msm_with_cached_scalars(&self, coeffs: &Self::CoeffsDescriptor<'_>, base: &[C]) -> C::Curve {
best_multiexp(coeffs.raw, base)
}
fn msm_with_cached_base(&self, coeffs: &[C::Scalar], base: &Self::BaseDescriptor<'_>) -> C::Curve {
best_multiexp(coeffs, base.raw)
}
fn msm_with_cached_inputs(&self, coeffs: &Self::CoeffsDescriptor<'_>, base: &Self::BaseDescriptor<'_>) -> C::Curve {
best_multiexp(coeffs.raw, base.raw)
}

@mratsim
Copy link
Contributor Author

mratsim commented Feb 19, 2024

Following discussion on Thursday (2024-02-15), superceded by privacy-scaling-explorations/halo2#277

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants