-
Notifications
You must be signed in to change notification settings - Fork 103
The Kokkos Lectures: Module 1 Q&A
Christian Trott edited this page Jul 24, 2020
·
1 revision
- The minimum version is currently 3.10 as of Kokkos 3.1.1 but we will likely move that up soon.
- Most of Kokkos is headers. But there are a few simple things which get pre-compiled but that takes only a few seconds.
- We have not tried, but if they can mount a docker image it might work. The reason we have the AWS instances is that we can exert pretty fine-grained control over the configuration so that our tutorial users don’t run into any unnecessarily complex hardware issues.
- No current plan to add to condo-forge.
How “heavy” is Kokkos
? How much extra am I going to have to deal with if I adopt Kokkos
into my project?
-
Kokkos
is mostly header files, but you do need to build it and add it to your CMake or other build system. Compilation times will likely increase, but it really depends. You can't get something for nothing, of course, so you have to weigh the costs in terms of developer time to write portable code withoutKokkos
versus compilation time to write code withKokkos
. Most of the extra cost is comparable to using standard algorithms and other template based abstractions layers.
- PETSc can perform some operations on GPUs and in some modes use 'user' pointers to data on the GPU that may be coming from a different GPU backend (e.g.,
Kokkos
). Our data structures allow you to get to the raw pointers, and thus generally you can achieve full interoperability with any non-Kokkos library which is implemented directly in one of our backend models (CUDA, HIP, OpenMP). For PETSc GPU interation look here. https://www.mcs.anl.gov/petsc/features/gpus.html
- There are two aspects to this: 1) yes we rely on auto vectorization of loops but help it along with things like
#pragma omp simd
or#pragma ivdep
. 2) We now have a SIMD package coming up (currently standalone available at https://github.com/kokkos/simd-math) which introduces portable vector types. That will map directly to vector intrinsics, and helps for example with writing outer loop vectorization code. We will cover this later in the lecture series.
- A single process can only use one GPU but you can use multiple processes using different GPUs. Note this answer may change in the not to distance future.
If HIP is enabled on CUDA, then will it create an executable which is a HIP version running on CUDA?
- HIP can target NVIDIA GPUs via CUDA but we do not support that. If you target NVIDIA GPUs, use the CUDA backend. If you want to run on AMD, use HIP.
Does Kokkos
support AMD GPU (backend with HIP) now? If not fully, what kind of capabilities does Kokkos
have now?
- More or less everything except for tasking and three-level parallelism should work.
- You tell it, but it defaults to sum. We have a concept of a
Kokkos
reducer that allows you to customize every aspect of that.
- A Functor can have arbitrary amount of state. So for the actual loop body you can do whatever you want also for a reduction. We also support reducers which can have state off their own, and are used to combine the various thread contributions in the runtime. Note: we are not sure yet how to support that in OpenMP Target, and it might be very slow in the end. For the other backends reducers with state are fine.
In the example that was given on slide 43, the lambda was updating the value valueToUpdtae
, but you also had to pass totalIntegral
to parallel_reduce
. When does totalIntegral
get the value of valueToUpdate
?
- The aggregation of the thread private variable (
valueToUpdate
) is handled by the runtime and is guaranteed to be intotalIntegral
at the end of the loop (for scalars) or at a fence (barrier). For more detailed information look here: https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3Aparallel_reduce
-
std::vector
doesn't work on GPUs and thus shouldn't be used in portable code. But yes for CPU only code that will work.
-
Kokkos
andKokkos::Experimental
are the only public namespaces.
- It doesn't matter really from Kokkos Core perspective. Most tools should also be fine. But it might make it easier to not have spaces for things like bash based analysis of output, since tools like awk are much easier to use that way.
- Some memory isn't deallocated, devices might not be synchronized, sanitizers will complain, your code might crash.
-
KOKKOS_LAMBDA
is a macro with the capture clause that adds necessary annotations forCUDA
orHIP
.
- We have a new capability which allows you to provide multiple reducers. Basically you just provide multiple reducers or result places, and also multiple thread local variables for your code (the following does two default plus, and one min reduction):
double result1, result2, result3;
Kokkos::parallel_reduce("LABEL",N, KOKKOS_LAMBDA(int i, double& lred1, double& lred2, double lred&3) {
...
},result1,Kokkos::Min<double>(result2),result3);