Skip to content

CCCL 2.7.0

Latest
Compare
Choose a tag to compare
@wmaxey wmaxey released this 06 Jan 22:12
· 629 commits to main since this release
v2.7.0
b5fe509

What’s New

C++

Thrust / CUB

  • Inclusive scan now supports initial value #1940
  • Inclusive and exclusive scan now support problem sizes exceeding 2^31 elements #2171
  • New cub::DeviceMerge::MergeKeys and cub::DeviceMerge::MergePairs algorithms #1817
  • New thrust::tabulate_output_iterator fancy iterator #2282

Libcudacxx

  • Enable Assertions on host and device depending on users choice
  • C++26 inplace_vector has been implemented and backported to C++14
  • Improved support for extended floating point types __half and __nv_bfloat16 both for cmath functions and complex
  • cuda::std::tuple is now trivially copyable if the stored types are trivially copyable
  • Reworked our atomics implementation
  • Improved <cuda/std/bit> conformance
  • Implemented <cuda/std/bitset> and backported to C++14
  • Implemented and backported C++20 bit_cast. It is available in all standard modes and constexpr with compiler support
  • Various backports and constexpr improvements (bool_constant, cuda::std::max)
  • Moved the experimental memory resources from <cuda/memory_resource> into <cuda/experimental/memory_resource.cuh>

Python

cuda.cooperative

Best practice of using CCCL to make your CUDA kernels easier to write and faster to execute is now available in Python through the cuda.cooperative module. This module currently supports block- and warp-level algorithms within numba.cuda kernels, offering speed-of-light reductions, prefix sums, radix, and merge sort. You can customize cuda.cooperative algorithms with user-defined data types and operators, implemented directly in Python.

Block and warp-level cooperative algorithms are now available in Python #1973.
Experimental versions of reduce, scan, merge and radix sort are available in numba.cuda kernels.

cuda.parallel

Apart from device-side cooperative algorithms, CCCL 2.7 provides an experimental version of host-side parallel algorithms as part of the cuda.parallel module. This release includes parallel reduction.

What's Changed

New Contributors

Full Changelog: v2.6.1...v2.7.0