Upcoming CUB/RDC changes #573
Replies: 2 comments
-
Note that this is documented in the CUDA Programming Guide
|
Beta Was this translation helpful? Give feedback.
-
Addition: unless there's another way to address this, we should make new behaviour opt-in so that it's possible to workaround compilation issues. Meanwhile, we'll tune CUB for the case of RDC without |
Beta Was this translation helpful? Give feedback.
-
Relocatable Device Code (RDC) doesn't work well with
__launch_bounds__
in some cases:When the code above is compiled we get following issues:
indicating mismatch in register allocations between kernel and device function. Extensive use of
__launch_bounds__
in CUB leads to compilation issues in the presence of RDC. To address this, we are going to create an indirection level that would not specify__launch_bounds__
for CUB kernels in the RDC case. We are also going to clamp the threads-per-block at 128 when RDC is used. This is needed so as not to exceed the registers count available per thread block (see the following PR for details).The change is expected to deteriorate performance in the RDC case. If you have alternative suggestions, please, feel free to participate in the discussion below.
Beta Was this translation helpful? Give feedback.
All reactions