The Kokkos Lectures: Module 1 Q&A

Build Related

What minimum CMake version is required?

The minimum version is currently 3.10 as of Kokkos 3.1.1 but we will likely move that up soon.

Which part of `Kokkos` needs to be built?

Most of Kokkos is headers. But there are a few simple things which get pre-compiled but that takes only a few seconds.

Can we use GPUs on Google Colab or Kaggle?

We have not tried, but if they can mount a docker image it might work. The reason we have the AWS instances is that we can exert pretty fine-grained control over the configuration so that our tutorial users don’t run into any unnecessarily complex hardware issues.

Any plans to have `Kokkos` available in conda-forge?

No current plan to add to condo-forge.

General Kokkos Questions

How “heavy” is `Kokkos`? How much extra am I going to have to deal with if I adopt `Kokkos` into my project?

Kokkos is mostly header files, but you do need to build it and add it to your CMake or other build system. Compilation times will likely increase, but it really depends. You can't get something for nothing, of course, so you have to weigh the costs in terms of developer time to write portable code without Kokkos versus compilation time to write code with Kokkos. Most of the extra cost is comparable to using standard algorithms and other template based abstractions layers.

Is there PETSc support for `Kokkos`?

PETSc can perform some operations on GPUs and in some modes use 'user' pointers to data on the GPU that may be coming from a different GPU backend (e.g., Kokkos). Our data structures allow you to get to the raw pointers, and thus generally you can achieve full interoperability with any non-Kokkos library which is implemented directly in one of our backend models (CUDA, HIP, OpenMP). For PETSc GPU interation look here. https://www.mcs.anl.gov/petsc/features/gpus.html

For optimizations like vectorization, does `Kokkos` still depend on compiler auto-vectorization?

There are two aspects to this: 1) yes we rely on auto vectorization of loops but help it along with things like #pragma omp simd or #pragma ivdep. 2) We now have a SIMD package coming up (currently standalone available at https://github.com/kokkos/simd-math) which introduces portable vector types. That will map directly to vector intrinsics, and helps for example with writing outer loop vectorization code. We will cover this later in the lecture series.

How many GPUs can a single process running a `Kokkos` program make use of?

A single process can only use one GPU but you can use multiple processes using different GPUs. Note this answer may change in the not to distance future.

If HIP is enabled on CUDA, then will it create an executable which is a HIP version running on CUDA?

HIP can target NVIDIA GPUs via CUDA but we do not support that. If you target NVIDIA GPUs, use the CUDA backend. If you want to run on AMD, use HIP.

Does `Kokkos` support AMD GPU (backend with HIP) now? If not fully, what kind of capabilities does `Kokkos` have now?

More or less everything except for tasking and three-level parallelism should work.

Code Questions

How does `parallel_reduce` know what operator to use on the thread-local results?

You tell it, but it defaults to sum. We have a concept of a Kokkos reducer that allows you to customize every aspect of that.

Can `Kokkos` see “within” the Functor? E.g. can I use a Functor with state to do a reduction?

A Functor can have arbitrary amount of state. So for the actual loop body you can do whatever you want also for a reduction. We also support reducers which can have state off their own, and are used to combine the various thread contributions in the runtime. Note: we are not sure yet how to support that in OpenMP Target, and it might be very slow in the end. For the other backends reducers with state are fine.

In the example that was given on slide 43, the lambda was updating the value `valueToUpdtae`, but you also had to pass `totalIntegral` to `parallel_reduce`. When does `totalIntegral` get the value of `valueToUpdate`?

The aggregation of the thread private variable (valueToUpdate) is handled by the runtime and is guaranteed to be in totalIntegral at the end of the loop (for scalars) or at a fence (barrier). For more detailed information look here: https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3Aparallel_reduce

Should I should use [&] with std::vector?

std::vector doesn't work on GPUs and thus shouldn't be used in portable code. But yes for CPU only code that will work.

What namespaces are in `Kokkos`?

Kokkos and Kokkos::Experimental are the only public namespaces.

Should spaces be avoided in the kernel names?

It doesn't matter really from Kokkos Core perspective. Most tools should also be fine. But it might make it easier to not have spaces for things like bash based analysis of output, since tools like awk are much easier to use that way.

What happens if you don’t call finalize()?

Some memory isn't deallocated, devices might not be synchronized, sanitizers will complain, your code might crash.

What is the difference between `KOKKOS_LAMBDA` and the standard C++ lambda notation?

KOKKOS_LAMBDA is a macro with the capture clause that adds necessary annotations for CUDA or HIP.

How to handle multiple reduction variables, like reduction(+:val1, val2) [from OpenMP]?

We have a new capability which allows you to provide multiple reducers. Basically you just provide multiple reducers or result places, and also multiple thread local variables for your code (the following does two default plus, and one min reduction):

double result1, result2, result3;
Kokkos::parallel_reduce("LABEL",N, KOKKOS_LAMBDA(int i, double& lred1, double& lred2, double lred&3) {
   ...
},result1,Kokkos::Min<double>(result2),result3);

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

The Kokkos Lectures: Module 1 Q&A

Build Related

What minimum CMake version is required?

Which part of `Kokkos` needs to be built?

Can we use GPUs on Google Colab or Kaggle?

Any plans to have `Kokkos` available in conda-forge?

General Kokkos Questions

How “heavy” is `Kokkos`? How much extra am I going to have to deal with if I adopt `Kokkos` into my project?

Is there PETSc support for `Kokkos`?

For optimizations like vectorization, does `Kokkos` still depend on compiler auto-vectorization?

How many GPUs can a single process running a `Kokkos` program make use of?

If HIP is enabled on CUDA, then will it create an executable which is a HIP version running on CUDA?

Does `Kokkos` support AMD GPU (backend with HIP) now? If not fully, what kind of capabilities does `Kokkos` have now?

Code Questions

How does `parallel_reduce` know what operator to use on the thread-local results?

Can `Kokkos` see “within” the Functor? E.g. can I use a Functor with state to do a reduction?

In the example that was given on slide 43, the lambda was updating the value `valueToUpdtae`, but you also had to pass `totalIntegral` to `parallel_reduce`. When does `totalIntegral` get the value of `valueToUpdate`?

Should I should use [&] with std::vector?

What namespaces are in `Kokkos`?

Should spaces be avoided in the kernel names?

What happens if you don’t call finalize()?

What is the difference between `KOKKOS_LAMBDA` and the standard C++ lambda notation?

How to handle multiple reduction variables, like reduction(+:val1, val2) [from OpenMP]?

Clone this wiki locally

The Kokkos Lectures: Module 1 Q&A

Build Related

What minimum CMake version is required?

Which part of Kokkos needs to be built?

Can we use GPUs on Google Colab or Kaggle?

Any plans to have Kokkos available in conda-forge?

General Kokkos Questions

How “heavy” is Kokkos? How much extra am I going to have to deal with if I adopt Kokkos into my project?

Is there PETSc support for Kokkos?

For optimizations like vectorization, does Kokkos still depend on compiler auto-vectorization?

How many GPUs can a single process running a Kokkos program make use of?

If HIP is enabled on CUDA, then will it create an executable which is a HIP version running on CUDA?

Does Kokkos support AMD GPU (backend with HIP) now? If not fully, what kind of capabilities does Kokkos have now?

Code Questions

How does parallel_reduce know what operator to use on the thread-local results?

Can Kokkos see “within” the Functor? E.g. can I use a Functor with state to do a reduction?

In the example that was given on slide 43, the lambda was updating the value valueToUpdtae, but you also had to pass totalIntegral to parallel_reduce. When does totalIntegral get the value of valueToUpdate?

Should I should use [&] with std::vector?

What namespaces are in Kokkos?

Should spaces be avoided in the kernel names?

What happens if you don’t call finalize()?

What is the difference between KOKKOS_LAMBDA and the standard C++ lambda notation?

How to handle multiple reduction variables, like reduction(+:val1, val2) [from OpenMP]?

Clone this wiki locally

Which part of `Kokkos` needs to be built?

Any plans to have `Kokkos` available in conda-forge?

How “heavy” is `Kokkos`? How much extra am I going to have to deal with if I adopt `Kokkos` into my project?

Is there PETSc support for `Kokkos`?

For optimizations like vectorization, does `Kokkos` still depend on compiler auto-vectorization?

How many GPUs can a single process running a `Kokkos` program make use of?

Does `Kokkos` support AMD GPU (backend with HIP) now? If not fully, what kind of capabilities does `Kokkos` have now?

How does `parallel_reduce` know what operator to use on the thread-local results?

Can `Kokkos` see “within” the Functor? E.g. can I use a Functor with state to do a reduction?

In the example that was given on slide 43, the lambda was updating the value `valueToUpdtae`, but you also had to pass `totalIntegral` to `parallel_reduce`. When does `totalIntegral` get the value of `valueToUpdate`?

What namespaces are in `Kokkos`?

What is the difference between `KOKKOS_LAMBDA` and the standard C++ lambda notation?