-
Notifications
You must be signed in to change notification settings - Fork 0
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Optimization potential in _template_inline.txt? #1
Comments
Hi @mreineck , You are right: I factored out the kernel evals to reduce their total count. The latest commit on master fixes this; thanks! |
Thanks a lot for giving this a try! I'm happy that it helps, although I had expected it to produce more of a speedup ... in principle, it should be possible to get roughly 5 million points in 3D with a support of 8**3 per CPU core and second; Numba doesn't appear to be optimizing very well in this situation :-( [Edit: the 5 million figure is what I see with Finufft and ducc, and it scales (though not perfectly) with increasing number of threads. I expect that the |
I agree that more code specialization should help here. My goal with this By doing so I could show that spread/interp factorization in a T3 (and variants of T1/T2 transforms) makes a huge difference when non-uniform knots are sparsely distributed. Obviously I won't say no to better spread/interp performance, but since supporting sparsity requires only minor changes to a NUFFT codebase, it is better to add support for it in a library such as FINUFFT rather than deploy yet-another library. I will discuss this with Alex soon. |
Please don't get me wrong! :-) While I'm of course always fascinated by accelerating implementations, the actual thing I wanted to understand with this experiment was: assuming that the spreader/interpolator is as fast as it is in finufft, is there actually any gain in precomputing the kernels and applying them to several transforms afterwards, or can we be (almost) as good if we compute the kernels on the fly every time and look at every transform in isolation? It seems that there is hope for the latter scenario, which would make me rather happy, since then the libraries then won't need some rather complicated changes. |
Looking at the code in https://github.com/SepandKashani/fourier_toolkit/blob/master/src/fourier_toolkit/spread/_template_inline.txt, I think there might be a way to greatly reduce the number of Horner kernel evaluations.
If we assume, for example, a 3D transform with a kernel that has 8x8x8 support, and if I'm reading the code correctly, this line
fourier_toolkit/src/fourier_toolkit/spread/_template_inline.txt
Line 175 in c38b199
is executed
8*8*8*3 = 1536
times for every non-uniform point.However, of all these 1536 calls, the
_phi
function is only ever called with 24 distinct arguments (8 for each point of the support times 3 for the number of dimensions), so I'm wondering if it may not be beneficial to compute these 24 values outside of theoffset
loop and just multiply them together appropriately inside this loop.I'm not sure how smart
numba
is, but I doubt that it will be able to perform this kind of optimization on its own.The text was updated successfully, but these errors were encountered: