Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Faster multiplication by generator #869

Open
ValarDragon opened this issue Jan 27, 2023 · 2 comments
Open

Faster multiplication by generator #869

ValarDragon opened this issue Jan 27, 2023 · 2 comments
Labels
optimization Performance related changes

Comments

@ValarDragon
Copy link
Collaborator

Was skimming the code, and couldn't find any sped up routine for multiplying by the primitive root for the goldilocks field.

I believe this is an operation that should be getting done on O(N) points for an N-element proof, due to taking a LDE of the input onto a coset. (Or having the codeword be a coset, though my recollection is thats actually slower)

The goldilocks field's smallest generator is 7, so should have a faster method of computing it via additions. If theres a faster mul_by_2 routine on this field, then it could be (roughly):

def mul_by_primitive_root(x):
  x2 = x.mul2()
  x3 = x2
  x3 += x
  x6 = x3.mul2()
  x7 = x6
  x7 += x
  return x7

Unknown how many times in the code you actually multiply by the primitive root, or how much of a speedup factor this would be in the small field, but thought I'd mention it in case it helps. (In Arkworks for BLS12-381, I at some point thought this would be more than 2x as fast as the general multiplication. But thats a much larger field, and because multiplying by two was twice as fast as normal addition)

@Nashtare
Copy link
Collaborator

Nashtare commented Jan 27, 2023

I haven't benchmarked it in the context of plonky2, so my comment is to be taken with a pinch of salt, but with this field size, the ratio between addition and multiplication is much smaller than with large fields (like on BLS12-381 for instance where it is around 4x/5x if I recall correctly?). In our curve implementation, the ratio is around 1.4, so I doubt such approach would yield any advantage, but maybe my assumptions are wrong with how arithmetic is performed here, especially with assembly instructions..

EDIT: To complete, in our approach, multiplying by the generator (or any u32 value) is done through a specific mul_by_u32() method which mimics the reduction of regular field multiplication, but ignores the highest limb (which is 0 here), effectively being faster than regular multiplication.

@dlubarov
Copy link
Contributor

dlubarov commented Jan 27, 2023

Yeah seems like a good idea to special case 7. As Robin said it might be best to still use multiplication, but the reduction should be somewhat faster.

I think a 96-bit reduction would have the same cost as a 67-bit reduction (cc @nbgl, correct me if wrong), so we could implement mul_by_u32 and it might have other uses.

We used to do something like that for our (now deleted) CrandallField, and there's still a from_noncanonical_u96 method in the code, though it's not implemented yet for GoldilocksField.

I don't think multiplication by a generator is a significant cost for us so we might not immediately notice a speedup, but it could add up if we also used mul_by_u32 when evaluating certain constraints etc.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
optimization Performance related changes
Projects
Status: Backlog
3 participants