Optimization potential in _template_inline.txt? #1

mreineck · 2024-12-29T21:27:55Z

Looking at the code in https://github.com/SepandKashani/fourier_toolkit/blob/master/src/fourier_toolkit/spread/_template_inline.txt, I think there might be a way to greatly reduce the number of Horner kernel evaluations.

If we assume, for example, a 3D transform with a kernel that has 8x8x8 support, and if I'm reading the code correctly, this line

fourier_toolkit/src/fourier_toolkit/spread/_template_inline.txt

Line 175 in c38b199

k *= _phi(a[d] * (z[d][idx[d]] - x[m, d]))

is executed 8*8*8*3 = 1536 times for every non-uniform point.
However, of all these 1536 calls, the _phi function is only ever called with 24 distinct arguments (8 for each point of the support times 3 for the number of dimensions), so I'm wondering if it may not be beneficial to compute these 24 values outside of the offset loop and just multiply them together appropriately inside this loop.
I'm not sure how smart numba is, but I doubt that it will be able to perform this kind of optimization on its own.

The text was updated successfully, but these errors were encountered:

SepandKashani · 2025-01-02T20:13:27Z

Hi @mreineck ,

You are right: I factored out the kernel evals to reduce their total count.
In 3D with 100k points (nthreads=1), the runtime of UniformSpread.apply decreases from 2.7s to 1.2s.
The gain is 10-15% with more threads though, as we discussed in the past, but good to take advantage of this nevertheless.

The latest commit on master fixes this; thanks!
(And I'll answer your email tomorrow ;) )

mreineck · 2025-01-02T20:49:37Z

Thanks a lot for giving this a try! I'm happy that it helps, although I had expected it to produce more of a speedup ... in principle, it should be possible to get roughly 5 million points in 3D with a support of 8**3 per CPU core and second; Numba doesn't appear to be optimizing very well in this situation :-(
In any case now I understand much better why precomputing and re-using the kernel values is so advantageous with this implementation!

[Edit: the 5 million figure is what I see with Finufft and ducc, and it scales (though not perfectly) with increasing number of threads. I expect that the np.ndindex construct is a big factor in the remaining slowdown. Explicit loops may actually help here, but that would of course require separate code paths for 1/2/3D...]

SepandKashani · 2025-01-03T10:53:55Z

[Edit: the 5 million figure is what I see with Finufft and ducc, and it scales (though not perfectly) with increasing number of threads. I expect that the np.ndindex construct is a big factor in the remaining slowdown. Explicit loops may actually help here, but that would of course require separate code paths for 1/2/3D...]

I agree that more code specialization should help here. My goal with this UniformSpread implementation was to have something replicating the spreading algorithm described in the FINUFFT paper, while being moderatively fast and generic.

By doing so I could show that spread/interp factorization in a T3 (and variants of T1/T2 transforms) makes a huge difference when non-uniform knots are sparsely distributed.

Obviously I won't say no to better spread/interp performance, but since supporting sparsity requires only minor changes to a NUFFT codebase, it is better to add support for it in a library such as FINUFFT rather than deploy yet-another library. I will discuss this with Alex soon.

mreineck · 2025-01-03T11:21:49Z

Please don't get me wrong! :-) While I'm of course always fascinated by accelerating implementations, the actual thing I wanted to understand with this experiment was: assuming that the spreader/interpolator is as fast as it is in finufft, is there actually any gain in precomputing the kernels and applying them to several transforms afterwards, or can we be (almost) as good if we compute the kernels on the fly every time and look at every transform in isolation? It seems that there is hope for the latter scenario, which would make me rather happy, since then the libraries then won't need some rather complicated changes.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Optimization potential in _template_inline.txt? #1

Optimization potential in _template_inline.txt? #1

mreineck commented Dec 29, 2024

SepandKashani commented Jan 2, 2025

mreineck commented Jan 2, 2025 •

edited

Loading

SepandKashani commented Jan 3, 2025

mreineck commented Jan 3, 2025

Optimization potential in _template_inline.txt? #1

Optimization potential in _template_inline.txt? #1

Comments

mreineck commented Dec 29, 2024

SepandKashani commented Jan 2, 2025

mreineck commented Jan 2, 2025 • edited Loading

SepandKashani commented Jan 3, 2025

mreineck commented Jan 3, 2025

mreineck commented Jan 2, 2025 •

edited

Loading