This repository has been archived by the owner on Mar 21, 2024. It is now read-only.
CUB 2.0.0 #551
alliepiper
announced in
Announcements
CUB 2.0.0
#551
Replies: 0 comments
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Summary
The CUB 2.0.0 major release adds a dependency on libcu++ and contains several breaking changes. These include new diagnostics when inspecting device-only lambdas from the host, an updated method of determining accumulator types for algorithms like Reduce and Scan, and a compile-time replacement for the runtime
debug_synchronous
debugging flags.This release also includes several new features.
DeviceHistogram
now supports__half
and better handles various edge cases.WarpReduce
now performs correctly when restricted to a single-thread “warp”, and will use the__reduce_add_sync
accelerated intrinsic (introduced with Ampere) when appropriate.DeviceRadixSort
learned to handle the case wherebegin_bit == end_bit
.Several algorithms also have updated documentation, with a particular focus on clarifying which operations can and cannot be performed in-place.
Breaking Changes
NV_IF_TARGET
ports. #448 Add libcu++ dependency (v1.8.0+).NV_IF_TARGET
ports. #448: The following macros are no longer defined by default. They can be re-enabled by definingCUB_PROVIDE_LEGACY_ARCH_MACROS
. These will be completely removed in a future release.CUB_IS_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_IS_DEVICE_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_HOST_CODE
: Replace withNV_IF_TARGET
.CUB_INCLUDE_DEVICE_CODE
: Replace withNV_IF_TARGET
.NV_IF_TARGET
. They are now defined consistently across all host/device compilation passes. This should not affect most usages of these macros, but may require changes for some edge cases.CUB_RUNTIME_FUNCTION
: Execution space annotations for functions that invoke CUDA Runtime APIs.__host__ __device__
__host__ __device__
__host__
__host__ __device__
__host__
CUB_RUNTIME_ENABLED
: No change in behavior, but no longer used in CUB. Provided for legacy support only. Legacy behavior:CUB_RDC_ENABLED
: New macro, may be combined withNV_IF_TARGET
to replace most usages ofCUB_RUNTIME_ENABLED
. Behavior:__device__
-only lambda’s return type is queried from host code (requires libcu++ ≥ 1.9.0).__device__
-only implementation ofoperator()
.__host__ __device__
lambda.cuda::proclaim_return_type
(Added in libcu++ 1.9.0)DeviceReduce
algorithm, following guidance from http://wg21.link/P2322R6.Dispatch*Reduce
layer have changed:DispatchReduce
:init
as initial type instead of output iterator value type.DispatchSegmentedReduce
:Equality
,Inequality
,InequalityWrapper
,Sum
,Difference
,Division
,Max
,ArgMax
,Min
,ArgMin
.ThreadReduce
now accepts accumulator type and uses a different type forprefix
.DeviceScan
,DeviceScanByKey
, andDeviceReduceByKey
algorithms, following guidance from http://wg21.link/P2322R6.Dispatch
layer have changed:DispatchScan
now accepts accumulator type as a template parameter.DispatchScanByKey
now accepts accumulator type as a template parameter.DispatchReduceByKey
now accepts accumulator type as the last template parameter.debug_synchronous
flags on device algorithms.CUB_DEBUG_SYNC
during compilation to enable these checks.New Features
__half
inDeviceHistogram
.WarpReduce
.__reduce_add_sync
hardware acceleration forWarpReduce
on supported architectures.Bug Fixes
begin_bit == end_bit
.DeviceHistogram::Even
for a variety of edge cases:SampleT
andLevelT
.LevelT
is an integral type and the number of levels does not evenly divide the level range.temp_storage_bytes
is properly set in theAdjacentDifferenceCopy
device algorithms.AdjacentDifferenceCopy
device algorithms.Other Enhancements
NV_IF_TARGET
ports. #448: Removed special case code for unsupported CUDA architectures.NV_IF_TARGET
ports. #448: Replace several usages of__CUDA_ARCH__
with<nv/target>
to handle host/device code divergence.NV_IF_TARGET
ports. #448: Mark unused PTX arch parameters as legacy.CUB_DISABLE_BF16_SUPPORT
to avoid including thecuda_bf16.h
header or using the__nv_bfloat16
type.DeviceScan
algorithms.DeviceHistogram
algorithms.DevicePartition
algorithms.Device*Sort
algorithms.DeviceReduce
algorithms.DeviceRunLengthEncode
algorithms.DeviceSelect
algorithms.WarpMergeSort
documentation.This discussion was created from the release CUB 2.0.0.
Beta Was this translation helpful? Give feedback.
All reactions