GPU acceleration for meep #2121
smartalecH
started this conversation in
Ideas
Replies: 1 comment 1 reply
-
This sounds very interesting, it seems that meep can be run on GPU, could you please provide an example? Thank you very much. |
Beta Was this translation helpful? Give feedback.
1 reply
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Many have expressed interest in enabling GPU acceleration within meep. For a long time, there was a debate regarding the practicality of such an endeavor, especially since it would require so much work (and hardware has been changing rather rapidly). However, recent results have shown that non-CPU accelerators are a great option for many FDTD codes (e.g. see gprMax and tidy3d).
Before anyone jumps into this project (I know a few different people who are ready to do so), I want to outline a few important considerations. Hopefully, we can use this thread to pool our thoughts together.
Things to be aware of
step()
from within a Python loop.My recommended approach
As I mentioned in #1719, the easiest way to enable meep on accelerators is using the existing openMP directive framework we implemented awhile back. This wouldn't require any modifications to the macros, chunk routines, or code generation. But the timestepping and convergence conditions would need to be modified to mitigate communication between CPU and accelerator. Unfortunately, the current openMP implementation is slower than the equivalent MPI implementation (see here). The performance bottlenecks should first be identified and resolved before moving to the next step involving accelerator offloading.
A more elegant approach, however, would be to use an abstract external library that interfaces with many HPC architectures and accelerators and is regularly updated and maintained. This way, a single timestepping "kernel" could be used for all architectures, which is determined using a compile-time flag.
While similar to the openMP directive approach above, this approach provides more control (and potential for better performance).
Kokkos provides a great framework for these kinds of applications.
What we shouldn't do, is write a bunch of customized CUDA kernels that won't ever be compatible with additional features...
Beta Was this translation helpful? Give feedback.
All reactions