Skip to content

The Kokkos Lectures: Module 2 Q&A

Christian Trott edited this page Jul 25, 2020 · 3 revisions

Exercise 1

The actual solution file used a KOKKOS_LAMBDA macro instead of [=], is that a standard feature of kokkos?

  • Yes it adds the necessary function markup to the [=] capture clause for getting device code in CUDA and HIP. KOKKOS_LAMBDA is getting introduced in Module 2.

Must the loop variables M and N be different for distinct kokkos::parallel_for directives or is just my misunderstanding?

  • The two arrays have different lengths so the parallel loops need different iteration counts. In general parallel loops are independent of each other.

Could one try to instruct kokkos to use UVM?

  • Yes, Kokkos can use UVM and does that by default if you compile with -DKokkos_ENABLE_CUDA_UVM=ON. That will change the default memory space of the Kokkos::Cuda execution space. Independent of that setting, both UVM and non-UVM allocations are available via the CudaUVMSpace and CudaSpace respectively. See Module 2 lecture.

Kokkos::kokkos_malloc seems redundant.

  • It is but it resolves some otherwise annoying ambiguous overload issues, both within Kokkos itself, and for the naughty people who type using Kokkos.

Is it also possible to instruct kokkos to prefetch data ?

  • Kokkos does some prefetching for you, but no, we don’t have a performance portable wrapper for prefetching. Generally we try to focus on not moving data whenever possible, so it might not be as important to do prefetching as you think. Prefetching largely occurs for some operations involving UVM views (like setting all elements to a value) and operations with the DualView class we will introduce in Module 3.

Views

Why is "double***" chosen instead of two parameters such as "double,3" to specify scalar type and rank of a View

  • The primary reason is that such a notation doesn't allow for differentiation of runtime and compile time dimensions. In mdspan for C++23 we have a different notation via two arguments so that View<double**[5][7]> becomes: basic_mdspan<double,extents<dynamic_extent,dynamic_extent,5,7>. While this allows for example of arbitrary order of runtime and compile dimensions it is much more verbose.
  • A second reason is that this notation imitates C-array notation and thus is familiar to people. Note: we do never create pointers to pointers though!

How do I create a View from an existing array; say I want to create View (rank 1) form my int * (that already has elements). Can view just point to my int*?

  • Yes, you can create an unmanaged View by providing a pointer instead of the label. No reference counting will be done. But you should make sure the MemorySpace argument matches what ptr you provide.
int* ptr = new int[N];
Kokkos::View<int*,Kokkos::HostSpace> a(ptr,N);

Normally I pass the address of an array as &an_kokkos_array to write out the full array. However, what is the best way to extract a chunk of a multi-dimensional Kokkos array and collective write it out to a file without tmp copy?

  • Kokkos::subview will be covered in module 4 next week.

Can I set a different precision for a hostView than its device view?

  • In the sense of mirroring: no. In the sense of general views: kind of. It will likely prevent you from doing a deep_copy. deep_copy requires either that a bit wise memcpy is valid, or that an execution space exists which can run a kernel accessing both views. So you can do a converting deep_copy from HostSpace to CudaUVMSpace or from CudaSpace to CudaHostPinnedSpace but you can't do a converting deep_copy from CudaSpace to HostSpace since neither GPU nor CPU can access both. So if your endpoints are those two, you need to allocate a third buffer view and do two deep_copy. One crossing the device barrier with a memcpy, and the other doing the conversion.

Layout Questions

Would the compile time view dimension specification work in the same way for both left and right memory layout types?

  • Yes. Layouts are independent of the rank description.

With LayoutLeft, does it still require the runtime-sized dimensions must come first in view?

  • Yes

Why does LayoutRight jump up for GPU at large sizes in the plot comparing layouts?

  • The row length gets really short, so the stride between elements gets small and thus you are getting close to coalesced access again. Further more if its short enough caching sets in so you do not throw away bandwidth to global memory.

Could I change the inner and outer iterate order in the MDRangePolicy to achieve a comparable high performance on the GPU even using the right layout, for example?

  • Yes. MDRangePolicy will be covered next week.

What is the default data layout with CudaUVMSpace?

  • LayoutLeft

General Kokkos Questions

Does Kokkos compile with MSVC on Windows?

  • Yes since 3.1. It is now part of our pull request testing actually.
  • We are also starting to support CUDA on Windows in Visual Studio but the build setup is iffy, and its not well tested yet.

Why is initialize not an object, that would be destroyed when going out of scope (an thus no need for Finalize) ?

  • We have Kokkos::ScopeGuard that does exactly that.

In principle, could kokkos function/lambda portability macros be policy type template parameters to ParallelFunctor/parallel_for? (I'm guessing there's some kind of semantic gotcha in doing it that way).

  • Short answer: No.
  • Longer answer: it might be possible to contrive something with clang which makes host/device part of the overload resolution. But NVCC doesn't. Likely this would require essentially a customization mechanism of the overload resolution mechanism in C++ - so probably it would never be done in a standardized way.

Is there any support for cuda tensorcores? and does the memory layout need to be adjusted?

  • Not in Kokkos Core directly. Remember tensorcores are NOT general purpose cores, they only support very specific math operations. The way these will be exposed is through KokkosKernels. If you do for example GEMM calls with the right compile time sized views, of the right data type and the right layout the call could directly map to tensor cores. This support doesn't exist right now but I believe folks are looking at it.

All calls to parallel_X are synchronous?

  • No. parallel_for is async. For parallel_reduce it depends what you are reducing into (reducing into a scalar blocks). For the most part, you should think of parallel_X calls and such as asynchronous. We will discuss more of this in Module 5.

Does Kokkos only use the default stream on cuda?

  • The default is yes, but you can specify a stream when you build a Cuda execution space instance, which allows you to use other streams. You pass it to the policy constructor to use it in your parallel algorithm then. We will discuss this in Module 5.

If I use view for a host scalar, will the paralle_reduce become non-blocking?

  • Yes/maybe. Semantically yes, not 100% sure we tested that it truly is.

Does the new reduction feature also allow the possibility to reduce an array of variables (using the same reduction,e.g. sum)?

  • No, but There is already a way to do that though, but only with explicit functors. See: https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3Aparallel_reduce under Requirements and then ReducerArgument being a ptr or array. It lists what you need to put into your functor, largely a value_count member and a value_type typedef.
Clone this wiki locally