Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Thrust 1.16.0

Compare
Choose a tag to compare
@alliepiper alliepiper released this 08 Feb 19:35
· 371 commits to main since this release

Summary

Thrust 1.16.0 provides a new “nosync” hint for the CUDA backend, as well as numerous bugfixes and stability improvements.

New thrust::cuda::par_nosync Execution Policy

Most of Thrust’s parallel algorithms are fully synchronous and will block the calling CPU thread until all work is completed. This design avoids many pitfalls associated with asynchronous GPU programming, resulting in simpler and less-error prone usage for new CUDA developers. Unfortunately, this improvement in user experience comes at a performance cost that often frustrates more experienced CUDA programmers.

Prior to this release, the only synchronous-to-asynchronous migration path for existing Thrust codebases involved significant refactoring, replacing calls to thrust algorithms with a limited set of future-based thrust::async algorithms or lower-level CUB kernels. The new thrust::cuda::par_nosync execution policy provides a new, less-invasive entry point for asynchronous computation.

par_nosync is a hint to the Thrust execution engine that any non-essential internal synchronizations should be skipped and that an explicit synchronization will be performed by the caller before accessing results.

While some Thrust algorithms require internal synchronization to safely compute their results, many do not. For example, multiple thrust::for_each invocations can be launched without waiting for earlier calls to complete:

// Queue three `for_each` kernels:
thrust::for_each(thrust::cuda::par_nosync, vec1.begin(), vec1.end(), Op{});
thrust::for_each(thrust::cuda::par_nosync, vec2.begin(), vec2.end(), Op{});
thrust::for_each(thrust::cuda::par_nosync, vec3.begin(), vec3.end(), Op{});

// Do other work while kernels execute:
do_something();

// Must explictly synchronize before accessing `for_each` results:
cudaDeviceSynchronize();

Thanks to @fkallen for this contribution.

Deprecation Notices

CUDA Dynamic Parallelism Support

A future version of Thrust will remove support for CUDA Dynamic Parallelism (CDP).

This will only affect calls to Thrust algorithms made from CUDA device-side code that currently launches a kernel; such calls will instead execute sequentially on the calling GPU thread instead of launching a device-wide kernel.

Breaking Changes

  • Thrust 1.14.0 included a change that aliased the cub namespace to thrust::cub. This has caused issues with ambiguous namespaces for projects that declare using namespace thrust; from the global namespace. We recommend against this practice.
  • #1572: Removed several unnecessary header includes. Downstream projects may need to update their includes if they were relying on this behavior.

New Features

  • #1568: Add thrust::cuda::par_nosync policy. Thanks to @fkallen for this contribution.

Enhancements

  • #1511: Use CUB’s new DeviceMergeSort API and remove Thrust’s internal implementation.
  • #1566: Improved performance of thrust::shuffle. Thanks to @djns99 for this contribution.
  • #1584: Support user-defined CMAKE_INSTALL_INCLUDEDIR values in Thrust’s CMake install rules. Thanks to @robertmaynard for this contribution.

Bug Fixes

  • #1496: Fix some issues affecting icc builds.
  • #1552: Fix some collisions with the min/max macros defined in windows.h.
  • #1582: Fix issue with function type alias on 32-bit MSVC builds.
  • #1591: Workaround issue affecting compilation with nvc++.
  • #1597: Fix some collisions with the small macro defined in windows.h.
  • #1599, #1603: Fix some issues with version handling in Thrust’s CMake packages.
  • #1614: Clarify that scan algorithm results are non-deterministic for pseudo-associative operators (e.g. floating-point addition).