-
Notifications
You must be signed in to change notification settings - Fork 6
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Prototype create_cache
for 3D prolong2mortars!
#9
Comments
That's correct 👍
Allocations of temporary arrays like these puts more pressure on the GC and impacts performance. That's why we have decided to pre-allocate them in You are free to choose a way to implement this on GPUs. If I were you, I would start with something simple to make it work - maybe just allocate everything inside |
Thanks!
That's not something has to worry about as the temporary variable can be initialized right before we launch the kernels and we access each element from it inside GPU kernels (avoid memory allocations). |
I asked how to deal with |
create_cache()
for 3D prolong2mortars!()
create_cache
for 3D prolong2mortars!
The colon issue can be easily resolved with using tricks with matrix indexing.
This will be added to the general kernel optimization issue here #22. |
First, I should claim that this issue can be applied not only to
prolong2mortars!
for 3D, but alsocalc_mortar_flux!
for 2D and 3D, andcalc_volume_integral!
-volume_integral::VolumeIntegralShockCapturingHG
for 1D, 2D, and 3D. But I ran into this issue first when trying to prototype 3Dprolong2mortars!
, thus I use this case as the example in the following content.3D
prolong2mortars!
would use something fromcache
by index likeThreads.threadid
, see https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L845-L846. The variablefstar_tmp1
we get is further used in https://github.com/trixi-framework/Trixi.jl/blob/e1e680ca8574acd10daa2e5bc5e1f49e1ce008f9/src/solvers/dgsem_tree/dg_3d.jl#L1013-L1024. And these functions can be traced back here https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem/interpolation.jl#L227-L254 Thefstar_tmp1
served like a temporary container and will be rewritten with new data later.fstar_tmp1
is initialized with undefined values in functioncreate_cache
, see https://github.com/huiyuxie/trixi_cuda/blob/ff81a6eedd31f8ebf5a15f0a6e91d833562a008a/trixi/src/solvers/dgsem_tree/dg_3d.jl#L101-L118 If I understand correctly,fstar_tmp1
and other values initialized increate_cache
are just temporary data holders and will be rewritten later in computation (as otherwise, they will not be initialized with undefined values). So the useful data that is going to be passed via these variables (fstar_tmp1
etc.) should beuEltype
,nvariables(equations)
,nnodes(mortar_l2)
, andnnodes(mortar_l2)
in this specific example.Here comes the problem. These values seem can also be acquired in
prolong2mortars!
without callingcreate_cache
(i.e. something likefstar_tmp1
can be directly created inprolong2mortars!
withoutcreate_cache
, so why need create more functions to create such temporary variables? For me, temporary variables can just be created in the process (i.e., no extra functions) unless they will be used multiple times.Further, I am unsure about whether the
create_cache
function needs a GPU prototype for kernels likeprolong2mortars!
, etc. Both ways should work successfully but the performance will be different. To be clear, I state both two ways again here (given the specific example I mentioned above):create_cache
to create temporary variables (values) in parallel.SemidiscretizationHyperbolic
will invokecreate_cache
and then temporary variables will finally retrieved inprolong2mortars!
viacache.fstar_tmp1_threaded
.create_cache
and create these temporary variables directly in kernel call likecuda_prolong2mortars!
, so there is no need to retrieve temporary variables viacache.fstar_tmp1_threaded
.In order to achieve better performance, I prefer way 2 but there may cause some issues (like if you create some types with specific field but the field will not be initialized). For better align with the CPU code, way 1 should be better. So which way should I choose?
If there is no replies/suggestions, I would go with the way I like best.
The text was updated successfully, but these errors were encountered: