Faster multiplication by generator #869

ValarDragon · 2023-01-27T08:08:43Z

Was skimming the code, and couldn't find any sped up routine for multiplying by the primitive root for the goldilocks field.

I believe this is an operation that should be getting done on O(N) points for an N-element proof, due to taking a LDE of the input onto a coset. (Or having the codeword be a coset, though my recollection is thats actually slower)

The goldilocks field's smallest generator is 7, so should have a faster method of computing it via additions. If theres a faster mul_by_2 routine on this field, then it could be (roughly):

def mul_by_primitive_root(x):
  x2 = x.mul2()
  x3 = x2
  x3 += x
  x6 = x3.mul2()
  x7 = x6
  x7 += x
  return x7

Unknown how many times in the code you actually multiply by the primitive root, or how much of a speedup factor this would be in the small field, but thought I'd mention it in case it helps. (In Arkworks for BLS12-381, I at some point thought this would be more than 2x as fast as the general multiplication. But thats a much larger field, and because multiplying by two was twice as fast as normal addition)

The text was updated successfully, but these errors were encountered:

Nashtare · 2023-01-27T13:41:10Z

I haven't benchmarked it in the context of plonky2, so my comment is to be taken with a pinch of salt, but with this field size, the ratio between addition and multiplication is much smaller than with large fields (like on BLS12-381 for instance where it is around 4x/5x if I recall correctly?). In our curve implementation, the ratio is around 1.4, so I doubt such approach would yield any advantage, but maybe my assumptions are wrong with how arithmetic is performed here, especially with assembly instructions..

EDIT: To complete, in our approach, multiplying by the generator (or any u32 value) is done through a specific mul_by_u32() method which mimics the reduction of regular field multiplication, but ignores the highest limb (which is 0 here), effectively being faster than regular multiplication.

dlubarov · 2023-01-27T20:18:50Z

Yeah seems like a good idea to special case 7. As Robin said it might be best to still use multiplication, but the reduction should be somewhat faster.

I think a 96-bit reduction would have the same cost as a 67-bit reduction (cc @nbgl, correct me if wrong), so we could implement mul_by_u32 and it might have other uses.

We used to do something like that for our (now deleted) CrandallField, and there's still a from_noncanonical_u96 method in the code, though it's not implemented yet for GoldilocksField.

I don't think multiplication by a generator is a significant cost for us so we might not immediately notice a speedup, but it could add up if we also used mul_by_u32 when evaluating certain constraints etc.

Nashtare mentioned this issue Feb 21, 2023

Add FFT-based specification for Poseidon MDS layer on x86 targets #886

Merged

Nashtare mentioned this issue Jul 31, 2023

Speed-up multiplication by small integers, and improve lookup part of compute_quotient_poly() #1153

Closed

Nashtare added the optimization Performance related changes label Oct 23, 2023

Nashtare added this to the Performance Tuning milestone Jul 12, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Faster multiplication by generator #869

Faster multiplication by generator #869

ValarDragon commented Jan 27, 2023

Nashtare commented Jan 27, 2023 •

edited

Loading

dlubarov commented Jan 27, 2023 •

edited

Loading

Faster multiplication by generator #869

Faster multiplication by generator #869

Comments

ValarDragon commented Jan 27, 2023

Nashtare commented Jan 27, 2023 • edited Loading

dlubarov commented Jan 27, 2023 • edited Loading

Nashtare commented Jan 27, 2023 •

edited

Loading

dlubarov commented Jan 27, 2023 •

edited

Loading