Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Optimization potential in _template_inline.txt? #1

Open
mreineck opened this issue Dec 29, 2024 · 4 comments
Open

Optimization potential in _template_inline.txt? #1

mreineck opened this issue Dec 29, 2024 · 4 comments

Comments

@mreineck
Copy link

Looking at the code in https://github.com/SepandKashani/fourier_toolkit/blob/master/src/fourier_toolkit/spread/_template_inline.txt, I think there might be a way to greatly reduce the number of Horner kernel evaluations.

If we assume, for example, a 3D transform with a kernel that has 8x8x8 support, and if I'm reading the code correctly, this line

k *= _phi(a[d] * (z[d][idx[d]] - x[m, d]))

is executed 8*8*8*3 = 1536 times for every non-uniform point.
However, of all these 1536 calls, the _phi function is only ever called with 24 distinct arguments (8 for each point of the support times 3 for the number of dimensions), so I'm wondering if it may not be beneficial to compute these 24 values outside of the offset loop and just multiply them together appropriately inside this loop.
I'm not sure how smart numba is, but I doubt that it will be able to perform this kind of optimization on its own.

@SepandKashani
Copy link
Owner

Hi @mreineck ,

You are right: I factored out the kernel evals to reduce their total count.
In 3D with 100k points (nthreads=1), the runtime of UniformSpread.apply decreases from 2.7s to 1.2s.
The gain is 10-15% with more threads though, as we discussed in the past, but good to take advantage of this nevertheless.

The latest commit on master fixes this; thanks!
(And I'll answer your email tomorrow ;) )

@mreineck
Copy link
Author

mreineck commented Jan 2, 2025

Thanks a lot for giving this a try! I'm happy that it helps, although I had expected it to produce more of a speedup ... in principle, it should be possible to get roughly 5 million points in 3D with a support of 8**3 per CPU core and second; Numba doesn't appear to be optimizing very well in this situation :-(
In any case now I understand much better why precomputing and re-using the kernel values is so advantageous with this implementation!

[Edit: the 5 million figure is what I see with Finufft and ducc, and it scales (though not perfectly) with increasing number of threads. I expect that the np.ndindex construct is a big factor in the remaining slowdown. Explicit loops may actually help here, but that would of course require separate code paths for 1/2/3D...]

@SepandKashani
Copy link
Owner

[Edit: the 5 million figure is what I see with Finufft and ducc, and it scales (though not perfectly) with increasing number of threads. I expect that the np.ndindex construct is a big factor in the remaining slowdown. Explicit loops may actually help here, but that would of course require separate code paths for 1/2/3D...]

I agree that more code specialization should help here. My goal with this UniformSpread implementation was to have something replicating the spreading algorithm described in the FINUFFT paper, while being moderatively fast and generic.

By doing so I could show that spread/interp factorization in a T3 (and variants of T1/T2 transforms) makes a huge difference when non-uniform knots are sparsely distributed.

Obviously I won't say no to better spread/interp performance, but since supporting sparsity requires only minor changes to a NUFFT codebase, it is better to add support for it in a library such as FINUFFT rather than deploy yet-another library. I will discuss this with Alex soon.

@mreineck
Copy link
Author

mreineck commented Jan 3, 2025

Please don't get me wrong! :-) While I'm of course always fascinated by accelerating implementations, the actual thing I wanted to understand with this experiment was: assuming that the spreader/interpolator is as fast as it is in finufft, is there actually any gain in precomputing the kernels and applying them to several transforms afterwards, or can we be (almost) as good if we compute the kernels on the fly every time and look at every transform in isolation? It seems that there is hope for the latter scenario, which would make me rather happy, since then the libraries then won't need some rather complicated changes.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants