-
Notifications
You must be signed in to change notification settings - Fork 103
The Kokkos Lectures: Module 3 Q&A
Christian Trott edited this page Aug 7, 2020
·
3 revisions
- Kokkos is Greek for "seed" or "grain". Originally Kokkos started as linear algebra portability layer, where the linear kernels would make up the "grains" of the solvers. Now we consider Kokkos just another word for "Performance Portability" ;-)
It was mentioned earlier that a view can be used as a reduction argument. is there an example for that somewhere?
-
Kokkos::View<int> result("R"); Kokkos::parallel_reduce("Reduction",N,my_functor,result);
- Note this is potentially asynchronous and you need a fence, before the result is written. More examples / conditions here: https://github.com/kokkos/kokkos/wiki/Kokkos%3A%3Aparallel_reduce
If I create two host mirror views of a view which is allocated on CudaSpace, do the two host mirror views point to the same memory address on the HostSpace, or they are able to store different data?
- They will be two different allocations. create_mirror_view etc. do NOT imply any permanent connection. They just created a new View either by copy construction or by allocating in the respective memory space.
- If you create the second one from the first host mirror view, they’ll point to the same data though.
- No, for various reasons, this is hard to do allocation on a GPU is very hard.
- Also if you want pushback in a concurrent data structure, you must acquire a lock for every read or write data access too, since you must make sure that a data access doesn't race with a reallocation happening. And thus even just pure use of the data structure without ever calling push back would be horribly slow.
- Our typical approach for this is Count, Allocate, Fill - often combined with an opportunistic Count/Fill combination. I.e. we run a kernel where we fill and count how many we filled, if we run out of space we continue to count. After the kernel if the count is larger than the data structure we reallocate and fill again from scratch. This at most costs you a 2x over a successful fill where you always know the size in advance. And 2x isn't all that bad compared to all the performance impact dynamic data structures would have.
- We do have applications that definitely need parallel push_back-like patterns no matter how you look at it, and it basically comes down to those applications are really hard to write. It’s possible and we have some patterns to handle it, but basically you want to try everything else first.
- Hierarchical Parallelism discussed in Module 4
For tiling size, what do you mean “fail”? Failing in getting an optimal performance, or getting a runtime error?
- Failing at run time.
Do these parallel_for with work tags still need a fence() in between or are they guaranteed to be called one after the other?
- work tags don't change anything about synchronization. All parallel_fors are guaranteed to be called in order as long as they’re on the same execution space instance.
- yes
- no. we don’t have strided slicing
For libraries that rely heavily on explicit template instantiation, what you do recommend doing about the type of Kokkos::sub view?
- You can use LayoutStride, or you can do a
using
somewhere that has thedecltype
of the relevant subview - If you have ETI you can still have deduced types, you just put them in the templates before you instantiate them, and trigger the type deduction when you explicitly instantiate the templates, just like any other deduced type
- LayoutStride will be less efficient in many cases than the actual deduced type of the subview
- yes, and arbitrary types (with certain restrictions that are probably less strict than you might think)
- yes, we actually support arbitrary atomic operations. They’re not always cheap, but they’re supported
- No, we need to fix that in the slide. They return bool
In run_a/run_c - must we call "modify_device()" prior to modification? Or is it sufficient to just call it before the next "sync_host()" call?
- The latter.
Could the DualView interface be reduced to a single call that both syncs and returns either a const or non-const view?
- Yes we probably could design an interface like that.
It seems like DualView and UVM memory space can server similar roles, what are the use-cases where each would be most advantageous?
- DualView works on platforms where UVM like automatic page migration is not available (AMD doesn't have that right now).
- Furthermore the transfer rates of DualView are determined by bandwidth not by page fault latency as in UVM (which can reduce effective bandwidth even on NVLink based systems to less than 10GB/s as opposed to the 100GB/s you can get with memcpy).
- UVM can have significant advantages if you only need to access a few elements of a View though, since it can transfer only that one page.