Prototype `create_cache` for 3D `prolong2mortars!` #9

huiyuxie · 2023-07-28T22:49:40Z

First, I should claim that this issue can be applied not only to prolong2mortars! for 3D, but also calc_mortar_flux! for 2D and 3D, and calc_volume_integral! - volume_integral::VolumeIntegralShockCapturingHG for 1D, 2D, and 3D. But I ran into this issue first when trying to prototype 3D prolong2mortars!, thus I use this case as the example in the following content.

3D prolong2mortars! would use something from cache by index like Threads.threadid, see https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L845-L846. The variable fstar_tmp1 we get is further used in https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L1013-L1024. And these functions can be traced back here https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem/interpolation.jl#L227-L254 The fstar_tmp1 served like a temporary container and will be rewritten with new data later.

fstar_tmp1 is initialized with undefined values in function create_cache, see https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem_tree/dg_3d.jl#L101-L118 If I understand correctly, fstar_tmp1 and other values initialized in create_cache are just temporary data holders and will be rewritten later in computation (as otherwise, they will not be initialized with undefined values). So the useful data that is going to be passed via these variables (fstar_tmp1 etc.) should be uEltype, nvariables(equations), nnodes(mortar_l2), and nnodes(mortar_l2) in this specific example.

Here comes the problem. These values seem can also be acquired in prolong2mortars! without calling create_cache (i.e. something like fstar_tmp1 can be directly created in prolong2mortars! without create_cache, so why need create more functions to create such temporary variables? For me, temporary variables can just be created in the process (i.e., no extra functions) unless they will be used multiple times.

Further, I am unsure about whether the create_cache function needs a GPU prototype for kernels like prolong2mortars!, etc. Both ways should work successfully but the performance will be different. To be clear, I state both two ways again here (given the specific example I mentioned above):

Prototype create_cache to create temporary variables (values) in parallel. SemidiscretizationHyperbolic will invoke create_cache and then temporary variables will finally retrieved in prolong2mortars! via cache.fstar_tmp1_threaded.
Remove relevant create_cache and create these temporary variables directly in kernel call like cuda_prolong2mortars!, so there is no need to retrieve temporary variables via cache.fstar_tmp1_threaded.

In order to achieve better performance, I prefer way 2 but there may cause some issues (like if you create some types with specific field but the field will not be initialized). For better align with the CPU code, way 1 should be better. So which way should I choose?

If there is no replies/suggestions, I would go with the way I like best.

The text was updated successfully, but these errors were encountered:

ranocha · 2023-08-07T11:58:54Z

If I understand correctly, fstar_tmp1 and other values initialized in create_cache() are just temporary data holders and will be rewritten later in computation (as otherwise, they will not be initialized with undefined values).

That's correct 👍

Here comes the problem. These values seem can also be acquired in prolong2mortars!() without calling create_cache() (i.e. something like fstar_tmp1 can be directly created in prolong2mortars!() without create_cache(), so why need create more functions to create such temporary variables? For me, temporary variables can just be created in the process (i.e., no extra functions) unless they will be used multiple times.

Allocations of temporary arrays like these puts more pressure on the GC and impacts performance. That's why we have decided to pre-allocate them in create_cache - which is called when the semidiscretization semi is constructed.

You are free to choose a way to implement this on GPUs. If I were you, I would start with something simple to make it work - maybe just allocate everything inside prolong2mortars! etc. as you suggested. Then, when everything works and tests are in place, you could run benchmarks and try to avoid allocations by adapting create_cache if necessary. I'm not sure whether allocations will work inside of the GPU kernels - since that's typically not supported and one of the obstacles you encountered earlier if I remember correctly.

huiyuxie · 2023-08-10T04:45:46Z

Thanks!

I'm not sure whether allocations will work inside of the GPU kernels - since that's typically not supported and one of the obstacles you encountered earlier if I remember correctly.

That's not something has to worry about as the temporary variable can be initialized right before we launch the kernels and we access each element from it inside GPU kernels (avoid memory allocations).

huiyuxie · 2023-08-18T08:23:05Z

I asked how to deal with Colon() (issue related to calling view() in kernel) here https://discourse.julialang.org/t/how-to-make-colon-stable-in-cuda-kernel/102934.

huiyuxie · 2024-08-28T06:19:36Z

The colon issue can be easily resolved with using tricks with matrix indexing.

Then, when everything works and tests are in place, you could run benchmarks and try to avoid allocations by adapting create_cache if necessary.

This will be added to the general kernel optimization issue here #22.

huiyuxie self-assigned this May 13, 2024

huiyuxie added the bug Something isn't working label May 13, 2024

huiyuxie removed their assignment May 13, 2024

huiyuxie changed the title ~~Prototype create_cache() for 3D prolong2mortars!()~~ Prototype create_cache for 3D prolong2mortars! Aug 22, 2024

huiyuxie closed this as completed in aac5e37 Aug 28, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prototype `create_cache` for 3D `prolong2mortars!` #9

Prototype `create_cache` for 3D `prolong2mortars!` #9

huiyuxie commented Jul 28, 2023 •

edited

Loading

ranocha commented Aug 7, 2023 •

edited

Loading

huiyuxie commented Aug 10, 2023 •

edited

Loading

huiyuxie commented Aug 18, 2023

huiyuxie commented Aug 28, 2024

Prototype create_cache for 3D prolong2mortars! #9

Prototype create_cache for 3D prolong2mortars! #9

Comments

huiyuxie commented Jul 28, 2023 • edited Loading

ranocha commented Aug 7, 2023 • edited Loading

huiyuxie commented Aug 10, 2023 • edited Loading

huiyuxie commented Aug 18, 2023

huiyuxie commented Aug 28, 2024

Prototype `create_cache` for 3D `prolong2mortars!` #9

Prototype `create_cache` for 3D `prolong2mortars!` #9

huiyuxie commented Jul 28, 2023 •

edited

Loading

ranocha commented Aug 7, 2023 •

edited

Loading

huiyuxie commented Aug 10, 2023 •

edited

Loading