[Hexagon DSP] How to fill static buffers created inside Halide #5166
Replies: 5 comments 8 replies
-
Are you sure it's because it's allocated differently and not due to some other reason (e.g. the LUT size is known at compile-time and so vlut instructions are getting used in one case but not the other)? I thought memory was memory once it was in the L2 on the dsp. Func::memoize lets you stash a compute_root Func over several runs of a pipeline, so you could try that. Start with the 1.3ms version that takes the lut as an input, then schedule the input like so:
Not sure if the memoization cache is supported on hexagon, but if it is this should work. |
Beta Was this translation helpful? Give feedback.
-
thanks @abadams , ran into some problem while compiling the generators : "computations which depend on buffer parameters cannot be scheduled compute_cached. Use memoize_tag to provide cache key information for buffer." |
Beta Was this translation helpful? Give feedback.
-
Halide isn't sure under what circumstances the buffer will change. You need to provide a scalar Expr (e.g. the constant 0, or a integer parameter) for the runtime to use as a cache key so it knows when to recompute the lut. In this case zero should suffice, as I gather the LUT never changes. Unfortunately the way you use memoize_tag is to wrap the offending expression with it, but that expression is hidden inside the anonymous Func created by Func::in. You'll need to wrap your lut in a Func like so:
Then use lut_func in the rest of the code instead of actual_lut. More on memoize_tag here: https://halide-lang.org/docs/namespace_halide.html#acc732961d942e7a91291310ee5f972b3 The odd interaction with Func::in isn't mentioned because it has never come up before. |
Beta Was this translation helpful? Give feedback.
-
did try with : but there is no impact on the latency it seems. Using the following throws a new error now Another thing to note here is if I use the halide buffer input (lut) directly without staging them into func first, then the latency is even more. I think in that case the data is fetched drom DRAM everytime. Staging them into func gets the latency gain (maybe in this case data is stored into L2 cache). Not sure of the reason of getting the best latency in the case of static halide buffer(with cont values) created inside halide generator. |
Beta Was this translation helpful? Give feedback.
-
Bummer, sounds like the memoization cache is unimplemented on Hexagon. But wait, are you saying that just staging eliminates the latency penalty? So the problem is solved? |
Beta Was this translation helpful? Give feedback.
-
Hi,
I want to use LUT (8x8) inside the generator. So I passed the ION buffer from DSP wrapper. Turns out I get 1.3 ms latency for the calculations using the LUT.
Now, I created a static halide buffer inside the Halide generator and used a static valued LUT. Something like :
I used lut_halide() instead of previous input buffer coming form input. Now the latency is reduced to 0.6 ms with bit matching.
The algorithm runs on many frames.
Now, what I want to do is, allocated 64 size buffer inside the Halide generator, and fill this memory with the input(LUT buffer) coming from the dsp wrapper on the first frame only (It is known that the LUT values are constant for every frame run).
Benefit : now the latency might come 1.3 ms for the first frame calculations but will get reduced to 0.6 ms for the rest of the frames since I will be using the same stack memory.
However I am not able to achieve this. Please Help!
Beta Was this translation helpful? Give feedback.
All reactions