Meta-Algorithm: Fold/Parallel Scan different implementation #6925
Replies: 2 comments
-
In the interest of responding quickly rather than comprehensively, have you looked at the There is an rfactor tutorial: https://github.com/halide/Halide/blob/main/tutorial/lesson_18_parallel_associative_reductions.cpp . Alas, using the built-in helpers such as |
Beta Was this translation helpful? Give feedback.
-
I have a CPU vector sum scan implementation that uses Sklansky for the within-vector summation here, in case it helps: https://github.com/halide/Halide/blob/abadams/gaussian_blur_app/apps/gaussian_blur/box_blur_generator.cpp#L241 It could probably be adapted to the other approaches. I place each vector of inputs at the vertices of a hypercube, and then each pass is an update step on that Func, vectorizing over all of the RDom dimensions (each of which is size 2). |
Beta Was this translation helpful? Give feedback.
-
Hello,
I'd to compare different implementation for a prefix-scan which is equivalent of a Fold or FoldList in Mathematica:
https://reference.wolfram.com/language/ref/Fold.html
https://reference.wolfram.com/language/ref/FoldList.html
For the sake of the example I'll focus on FoldAdd '+' (but the question is the same of min, max, ...).
Let say I want:
After parallisation, always produce an atomic_add which is far from optimal.
Based on definition of an "algorithm" of Halide it's as design, so I'd to implement a meta-algorithm with:
cf. https://developer.nvidia.com/gpugems/gpugems3/part-vi-gpu-computing/chapter-39-parallel-prefix-sum-scan-cuda
Now with the current definition of "Algorithm" by Halide, that implies one implementation per "semantic-function".
Semantically I fand to accumulate the whole number in a 1D-Array.
Currently I'm discussion the feasability of that. On my understanding that implies a creation of N-Function which can be scheduled differently with "Ping-pong" buffer or reusing the same buffer. The main issue is 'N' is a consequence of internal-size.
For instance we want to do:
It's an open discussion which could open an higher level of description for generic algorithm.
Best.
Beta Was this translation helpful? Give feedback.
All reactions