GPU acceleration for meep #2121

smartalecH · 2022-07-02T00:35:27Z

smartalecH
Jul 2, 2022
Collaborator

Many have expressed interest in enabling GPU acceleration within meep. For a long time, there was a debate regarding the practicality of such an endeavor, especially since it would require so much work (and hardware has been changing rather rapidly). However, recent results have shown that non-CPU accelerators are a great option for many FDTD codes (e.g. see gprMax and tidy3d).

Before anyone jumps into this project (I know a few different people who are ready to do so), I want to outline a few important considerations. Hopefully, we can use this thread to pool our thoughts together.

Things to be aware of

Meep relies on numerous macros to perform several common actions, like looping over various aspects of the grid (see here). Porting these macros to an accelerator will require extreme care.
Meep already "chunks" the simulation domain to (1) ensure a particular chunk uses the same stencil for each pixel; (2) enable distributed computing with MPI (see here). The chunking routine would have to be modified to allow for the same degree of granularity required by an accelerator (note that many accelerators typically have a hierarchy of parallelism, as described here).
The current stencils require some degree of code generation to maximize hardware throughput for certain data loops. This code generation will have to be refactored to support any new "kernels".
The timestepping routines need to be constructed to mitigate communication between the accelerator and the GPU. Currently, the Python code simply calls step() from within a Python loop.

My recommended approach

As I mentioned in #1719, the easiest way to enable meep on accelerators is using the existing openMP directive framework we implemented awhile back. This wouldn't require any modifications to the macros, chunk routines, or code generation. But the timestepping and convergence conditions would need to be modified to mitigate communication between CPU and accelerator. Unfortunately, the current openMP implementation is slower than the equivalent MPI implementation (see here). The performance bottlenecks should first be identified and resolved before moving to the next step involving accelerator offloading.

A more elegant approach, however, would be to use an abstract external library that interfaces with many HPC architectures and accelerators and is regularly updated and maintained. This way, a single timestepping "kernel" could be used for all architectures, which is determined using a compile-time flag.
While similar to the openMP directive approach above, this approach provides more control (and potential for better performance).
Kokkos provides a great framework for these kinds of applications.

What we shouldn't do, is write a bunch of customized CUDA kernels that won't ever be compatible with additional features...

niubiworker · 2024-07-24T13:31:48Z

niubiworker
Jul 24, 2024

This sounds very interesting, it seems that meep can be run on GPU, could you please provide an example? Thank you very much.

1 reply

smartalecH Jul 24, 2024
Collaborator Author

No, the above is just a proposal if one wanted to port meep to GPU, which is a very big endeavor.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GPU acceleration for meep #2121

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 1 comment 1 reply

{{title}}

{{title}}

Select a reply

GPU acceleration for meep #2121

smartalecH Jul 2, 2022 Collaborator

Things to be aware of

My recommended approach

Replies: 1 comment · 1 reply

niubiworker Jul 24, 2024

smartalecH Jul 24, 2024 Collaborator Author

smartalecH
Jul 2, 2022
Collaborator

Replies: 1 comment 1 reply

niubiworker
Jul 24, 2024

smartalecH Jul 24, 2024
Collaborator Author