Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Prototype create_cache for 3D prolong2mortars! #9

Closed
huiyuxie opened this issue Jul 28, 2023 · 4 comments
Closed

Prototype create_cache for 3D prolong2mortars! #9

huiyuxie opened this issue Jul 28, 2023 · 4 comments
Labels
bug Something isn't working

Comments

@huiyuxie
Copy link
Member

huiyuxie commented Jul 28, 2023

First, I should claim that this issue can be applied not only to prolong2mortars! for 3D, but also calc_mortar_flux! for 2D and 3D, and calc_volume_integral! - volume_integral::VolumeIntegralShockCapturingHG for 1D, 2D, and 3D. But I ran into this issue first when trying to prototype 3D prolong2mortars!, thus I use this case as the example in the following content.

3D prolong2mortars! would use something from cache by index like Threads.threadid, see https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L845-L846. The variable fstar_tmp1 we get is further used in https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L1013-L1024. And these functions can be traced back here https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem/interpolation.jl#L227-L254 The fstar_tmp1 served like a temporary container and will be rewritten with new data later.

fstar_tmp1 is initialized with undefined values in function create_cache, see https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem_tree/dg_3d.jl#L101-L118 If I understand correctly, fstar_tmp1 and other values initialized in create_cache are just temporary data holders and will be rewritten later in computation (as otherwise, they will not be initialized with undefined values). So the useful data that is going to be passed via these variables (fstar_tmp1 etc.) should be uEltype, nvariables(equations), nnodes(mortar_l2), and nnodes(mortar_l2) in this specific example.

Here comes the problem. These values seem can also be acquired in prolong2mortars! without calling create_cache (i.e. something like fstar_tmp1 can be directly created in prolong2mortars! without create_cache, so why need create more functions to create such temporary variables? For me, temporary variables can just be created in the process (i.e., no extra functions) unless they will be used multiple times.

Further, I am unsure about whether the create_cache function needs a GPU prototype for kernels like prolong2mortars!, etc. Both ways should work successfully but the performance will be different. To be clear, I state both two ways again here (given the specific example I mentioned above):

  1. Prototype create_cache to create temporary variables (values) in parallel. SemidiscretizationHyperbolic will invoke create_cache and then temporary variables will finally retrieved in prolong2mortars! via cache.fstar_tmp1_threaded.
  2. Remove relevant create_cache and create these temporary variables directly in kernel call like cuda_prolong2mortars!, so there is no need to retrieve temporary variables via cache.fstar_tmp1_threaded.

In order to achieve better performance, I prefer way 2 but there may cause some issues (like if you create some types with specific field but the field will not be initialized). For better align with the CPU code, way 1 should be better. So which way should I choose?

If there is no replies/suggestions, I would go with the way I like best.

@ranocha
Copy link
Collaborator

ranocha commented Aug 7, 2023

If I understand correctly, fstar_tmp1 and other values initialized in create_cache() are just temporary data holders and will be rewritten later in computation (as otherwise, they will not be initialized with undefined values).

That's correct 👍

Here comes the problem. These values seem can also be acquired in prolong2mortars!() without calling create_cache() (i.e. something like fstar_tmp1 can be directly created in prolong2mortars!() without create_cache(), so why need create more functions to create such temporary variables? For me, temporary variables can just be created in the process (i.e., no extra functions) unless they will be used multiple times.

Allocations of temporary arrays like these puts more pressure on the GC and impacts performance. That's why we have decided to pre-allocate them in create_cache - which is called when the semidiscretization semi is constructed.

You are free to choose a way to implement this on GPUs. If I were you, I would start with something simple to make it work - maybe just allocate everything inside prolong2mortars! etc. as you suggested. Then, when everything works and tests are in place, you could run benchmarks and try to avoid allocations by adapting create_cache if necessary. I'm not sure whether allocations will work inside of the GPU kernels - since that's typically not supported and one of the obstacles you encountered earlier if I remember correctly.

@huiyuxie
Copy link
Member Author

huiyuxie commented Aug 10, 2023

Thanks!

I'm not sure whether allocations will work inside of the GPU kernels - since that's typically not supported and one of the obstacles you encountered earlier if I remember correctly.

That's not something has to worry about as the temporary variable can be initialized right before we launch the kernels and we access each element from it inside GPU kernels (avoid memory allocations).

@huiyuxie
Copy link
Member Author

I asked how to deal with Colon() (issue related to calling view() in kernel) here https://discourse.julialang.org/t/how-to-make-colon-stable-in-cuda-kernel/102934.

@huiyuxie huiyuxie self-assigned this May 13, 2024
@huiyuxie huiyuxie added the bug Something isn't working label May 13, 2024
@huiyuxie huiyuxie removed their assignment May 13, 2024
@huiyuxie huiyuxie changed the title Prototype create_cache() for 3D prolong2mortars!() Prototype create_cache for 3D prolong2mortars! Aug 22, 2024
@huiyuxie
Copy link
Member Author

The colon issue can be easily resolved with using tricks with matrix indexing.

Then, when everything works and tests are in place, you could run benchmarks and try to avoid allocations by adapting create_cache if necessary.

This will be added to the general kernel optimization issue here #22.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants