-
Notifications
You must be signed in to change notification settings - Fork 103
The Kokkos Lectures: Module 2 Q&A
Christian Trott edited this page Jul 25, 2020
·
3 revisions
The actual solution file used a KOKKOS_LAMBDA macro instead of [=], is that a standard feature of kokkos?
- Yes it adds the necessary function markup to the
[=]
capture clause for getting device code in CUDA and HIP. KOKKOS_LAMBDA is getting introduced in Module 2.
Must the loop variables M and N be different for distinct kokkos::parallel_for directives or is just my misunderstanding?
- The two arrays have different lengths so the parallel loops need different iteration counts. In general parallel loops are independent of each other.
- Yes, Kokkos can use UVM and does that by default if you compile with -DKokkos_ENABLE_CUDA_UVM=ON. That will change the default memory space of the
Kokkos::Cuda
execution space. Independent of that setting, both UVM and non-UVM allocations are available via theCudaUVMSpace
andCudaSpace
respectively. See Module 2 lecture.
- It is but it resolves some otherwise annoying ambiguous overload issues, both within Kokkos itself, and for the naughty people who type
using Kokkos
.
- Kokkos does some prefetching for you, but no, we don’t have a performance portable wrapper for prefetching. Generally we try to focus on not moving data whenever possible, so it might not be as important to do prefetching as you think. Prefetching largely occurs for some operations involving UVM views (like setting all elements to a value) and operations with the DualView class we will introduce in Module 3.
Why is "double***" chosen instead of two parameters such as "double,3" to specify scalar type and rank of a View
- The primary reason is that such a notation doesn't allow for differentiation of runtime and compile time dimensions. In mdspan for C++23 we have a different notation via two arguments so that
View<double**[5][7]>
becomes:basic_mdspan<double,extents<dynamic_extent,dynamic_extent,5,7>
. While this allows for example of arbitrary order of runtime and compile dimensions it is much more verbose. - A second reason is that this notation imitates C-array notation and thus is familiar to people. Note: we do never create pointers to pointers though!
How do I create a View from an existing array; say I want to create View (rank 1) form my int * (that already has elements). Can view just point to my int*?
- Yes, you can create an unmanaged View by providing a pointer instead of the label. No reference counting will be done. But you should make sure the MemorySpace argument matches what ptr you provide.
int* ptr = new int[N];
Kokkos::View<int*,Kokkos::HostSpace> a(ptr,N);
Normally I pass the address of an array as &an_kokkos_array to write out the full array. However, what is the best way to extract a chunk of a multi-dimensional Kokkos array and collective write it out to a file without tmp copy?
- Kokkos::subview will be covered in module 4 next week.
- In the sense of mirroring: no. In the sense of general views: kind of. It will likely prevent you from doing a
deep_copy
.deep_copy
requires either that a bit wise memcpy is valid, or that an execution space exists which can run a kernel accessing both views. So you can do a convertingdeep_copy
fromHostSpace
toCudaUVMSpace
or fromCudaSpace
toCudaHostPinnedSpace
but you can't do a convertingdeep_copy
fromCudaSpace
toHostSpace
since neither GPU nor CPU can access both. So if your endpoints are those two, you need to allocate a third buffer view and do twodeep_copy
. One crossing the device barrier with a memcpy, and the other doing the conversion.
Would the compile time view dimension specification work in the same way for both left and right memory layout types?
- Yes. Layouts are independent of the rank description.
- Yes
- The row length gets really short, so the stride between elements gets small and thus you are getting close to coalesced access again. Further more if its short enough caching sets in so you do not throw away bandwidth to global memory.
Could I change the inner and outer iterate order in the MDRangePolicy to achieve a comparable high performance on the GPU even using the right layout, for example?
- Yes. MDRangePolicy will be covered next week.
- LayoutLeft
- Yes since 3.1. It is now part of our pull request testing actually.
- We are also starting to support CUDA on Windows in Visual Studio but the build setup is iffy, and its not well tested yet.
Why is initialize not an object, that would be destroyed when going out of scope (an thus no need for Finalize) ?
- We have Kokkos::ScopeGuard that does exactly that.
In principle, could kokkos function/lambda portability macros be policy type template parameters to ParallelFunctor/parallel_for? (I'm guessing there's some kind of semantic gotcha in doing it that way).
- Short answer: No.
- Longer answer: it might be possible to contrive something with clang which makes host/device part of the overload resolution. But NVCC doesn't. Likely this would require essentially a customization mechanism of the overload resolution mechanism in C++ - so probably it would never be done in a standardized way.
- Not in Kokkos Core directly. Remember tensorcores are NOT general purpose cores, they only support very specific math operations. The way these will be exposed is through KokkosKernels. If you do for example GEMM calls with the right compile time sized views, of the right data type and the right layout the call could directly map to tensor cores. This support doesn't exist right now but I believe folks are looking at it.
- No. parallel_for is async. For parallel_reduce it depends what you are reducing into (reducing into a scalar blocks). For the most part, you should think of parallel_X calls and such as asynchronous. We will discuss more of this in Module 5.
- The default is yes, but you can specify a stream when you build a Cuda execution space instance, which allows you to use other streams. You pass it to the policy constructor to use it in your parallel algorithm then. We will discuss this in Module 5.
- Yes/maybe. Semantically yes, not 100% sure we tested that it truly is.
Does the new reduction feature also allow the possibility to reduce an array of variables (using the same reduction,e.g. sum)?
- No, but There is already a way to do that though, but only with explicit functors. See: https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3Aparallel_reduce under Requirements and then ReducerArgument being a ptr or array. It lists what you need to put into your functor, largely a value_count member and a value_type typedef.