[Hexagon DSP] How to fill static buffers created inside Halide #5166

ImageHandler · 2020-08-08T22:11:46Z

ImageHandler
Aug 8, 2020

Hi,
I want to use LUT (8x8) inside the generator. So I passed the ION buffer from DSP wrapper. Turns out I get 1.3 ms latency for the calculations using the LUT.
Now, I created a static halide buffer inside the Halide generator and used a static valued LUT. Something like :

static short LUT[8*8] = {1,2,3....,64 values}; //(the values are similar to those in the LUT buffer, but only they are now constant and assigned manually)
static Halide::Buffer<short> lut_halide(LUT);

I used lut_halide() instead of previous input buffer coming form input. Now the latency is reduced to 0.6 ms with bit matching.

The algorithm runs on many frames.
Now, what I want to do is, allocated 64 size buffer inside the Halide generator, and fill this memory with the input(LUT buffer) coming from the dsp wrapper on the first frame only (It is known that the LUT values are constant for every frame run).

Benefit : now the latency might come 1.3 ms for the first frame calculations but will get reduced to 0.6 ms for the rest of the frames since I will be using the same stack memory.

However I am not able to achieve this. Please Help!

abadams · 2020-08-08T22:33:08Z

abadams
Aug 8, 2020
Maintainer

Are you sure it's because it's allocated differently and not due to some other reason (e.g. the LUT size is known at compile-time and so vlut instructions are getting used in one case but not the other)? I thought memory was memory once it was in the L2 on the dsp.

Func::memoize lets you stash a compute_root Func over several runs of a pipeline, so you could try that. Start with the 1.3ms version that takes the lut as an input, then schedule the input like so:

lut_input.in().compute_root().memoize();

Not sure if the memoization cache is supported on hexagon, but if it is this should work.

0 replies

ImageHandler · 2020-08-09T11:43:40Z

ImageHandler
Aug 9, 2020
Author

thanks @abadams , ran into some problem while compiling the generators : "computations which depend on buffer parameters cannot be scheduled compute_cached. Use memoize_tag to provide cache key information for buffer."
Also, there are not enough examples given to understand the halide keywords or halide APIs. Any place where I could get some help to utilize halide to a great extent?

0 replies

abadams · 2020-08-10T04:21:18Z

abadams
Aug 10, 2020
Maintainer

Halide isn't sure under what circumstances the buffer will change. You need to provide a scalar Expr (e.g. the constant 0, or a integer parameter) for the runtime to use as a cache key so it knows when to recompute the lut. In this case zero should suffice, as I gather the LUT never changes.

Unfortunately the way you use memoize_tag is to wrap the offending expression with it, but that expression is hidden inside the anonymous Func created by Func::in. You'll need to wrap your lut in a Func like so:

Func lut_func;
lut_func(x) = memoize_tag(actual_lut(x), Expr(0));

Then use lut_func in the rest of the code instead of actual_lut.

More on memoize_tag here: https://halide-lang.org/docs/namespace_halide.html#acc732961d942e7a91291310ee5f972b3

The odd interaction with Func::in isn't mentioned because it has never come up before.

0 replies

ImageHandler · 2020-08-10T10:43:40Z

ImageHandler
Aug 10, 2020
Author

did try with :
lut_func(x) = memoize_tag(actual_lut(x), Expr(0));

but there is no impact on the latency it seems. Using the following throws a new error now
lut_input.in().compute_root().memoize();
"External function halide_memoization_cache_lookup is marked as taking user_context, but it's not in the runtime module. Check if the runtime_api.cpp need to be rebuilt"

Another thing to note here is if I use the halide buffer input (lut) directly without staging them into func first, then the latency is even more. I think in that case the data is fetched drom DRAM everytime. Staging them into func gets the latency gain (maybe in this case data is stored into L2 cache). Not sure of the reason of getting the best latency in the case of static halide buffer(with cont values) created inside halide generator.

0 replies

abadams · 2020-08-10T16:42:12Z

abadams
Aug 10, 2020
Maintainer

Bummer, sounds like the memoization cache is unimplemented on Hexagon.

But wait, are you saying that just staging eliminates the latency penalty? So the problem is solved?

8 replies

ImageHandler Aug 11, 2020
Author

Yeah, the lut size is 8*8.
What do you mean by constant overhead by ion allocations?
I think the ION allocations are done from the cpu side and it should not impact the latency while being on dsp. FYI this is just the latency for the dsp run (without rpc overhead)

pranavb-ca Aug 19, 2020
Collaborator

Sorry to jump in on this late. I'd like to focus on making sure if you are in fact getting zero-copies. How are you allocating the 64 element lut?

ImageHandler Aug 19, 2020
Author

Like told before, one way is just copy the elements from cpu to ion buffer and then pass it on to the halide generator. Second, is to create a stack memory inside the generator and provide the values manually.

pranavb-ca Aug 20, 2020
Collaborator

Are you allocating the ion buffer yourself? Or are you using the hexagon device interface?

ImageHandler Aug 21, 2020
Author

Allocating the ION buffer manually. I'm Using stadalone mode for Ha;ide HVX execution.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Hexagon DSP] How to fill static buffers created inside Halide #5166

{{title}}

Replies: 5 comments 8 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

[Hexagon DSP] How to fill static buffers created inside Halide #5166

ImageHandler Aug 8, 2020

Replies: 5 comments · 8 replies

abadams Aug 8, 2020 Maintainer

ImageHandler Aug 9, 2020 Author

abadams Aug 10, 2020 Maintainer

ImageHandler Aug 10, 2020 Author

abadams Aug 10, 2020 Maintainer

ImageHandler Aug 11, 2020 Author

pranavb-ca Aug 19, 2020 Collaborator

ImageHandler Aug 19, 2020 Author

pranavb-ca Aug 20, 2020 Collaborator

ImageHandler Aug 21, 2020 Author

ImageHandler
Aug 8, 2020

Replies: 5 comments 8 replies

abadams
Aug 8, 2020
Maintainer

ImageHandler
Aug 9, 2020
Author

abadams
Aug 10, 2020
Maintainer

ImageHandler
Aug 10, 2020
Author

abadams
Aug 10, 2020
Maintainer

ImageHandler Aug 11, 2020
Author

pranavb-ca Aug 19, 2020
Collaborator

ImageHandler Aug 19, 2020
Author

pranavb-ca Aug 20, 2020
Collaborator

ImageHandler Aug 21, 2020
Author