Releases: NVIDIA/thrust
Thrust 1.9.7-1 (CUDA Toolkit 10.2 for Tegra)
Thrust 1.9.7-1 is a minor release accompanying the CUDA Toolkit 10.2 release for Tegra. It is nearly identical to 1.9.7.
Bug Fixes
- Remove support for GCC's broken nodiscard-like attribute.
Thrust 1.9.7 (CUDA Toolkit 10.2)
Thrust 1.9.7 is a minor release accompanying the CUDA Toolkit 10.2 release. Unfortunately, although the version and patch numbers are identical, one bug fix present in Thrust 1.9.7 (NVBug 2646034: Fix incorrect dependency handling for stream acquisition in thrust::future
) was not included in the CUDA Toolkit 10.2 preview release for AArch64 SBSA. The tag cuda-10.2aarch64sbsa
contains the exact version of Thrust present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.
Bug Fixes
- #967, NVBug 2448170: Fix the CUDA backend
thrust::for_each
so that it supports large input sizes with 64-bit indices. - NVBug 2646034: Fix incorrect dependency handling for stream acquisition in
thrust::future
.- Not present in the CUDA Toolkit 10.2 preview release for AArch64 SBSA.
- #968, NVBug 2612102: Fix the
thrust::mr::polymorphic_adaptor
to actually use its template parameter.
Thrust 1.9.6-1 (NVIDIA HPC SDK 20.3)
Thrust 1.9.6-1 is a variant of 1.9.6 accompanying the NVIDIA HPC SDK 20.3 release. It contains modifications necessary to serve as the implementation of NVC++'s GPU-accelerated C++17 Parallel Algorithms when using the CUDA Toolkit 10.1 Update 2 release.
Thrust 1.9.6 (CUDA Toolkit 10.1 Update 2)
Thrust 1.9.6 is a minor release accompanying the CUDA Toolkit 10.1 Update 2 release.
Bug Fixes
- NVBug 2509847: Inconsistent alignment of
thrust::complex
- NVBug 2586774: Compilation failure with Clang + older libstdc++ that doesn't have
std::is_trivially_copyable
- NVBug 200488234: CUDA header files contain unicode characters which leads compiling errors on Windows
- #949, #973, NVBug 2422333, NVBug 2522259, NVBug 2528822:
thrust::detail::aligned_reinterpret_cast
must be annotated with__host__ __device__
. - NVBug 2599629: Missing include in the OpenMP sort implementation
- NVBug 200513211: Truncation warning in test code under VC142
Thrust 1.9.5 (CUDA Toolkit 10.1 Update 1)
Thrust v1.9.5 is a minor bugfix release accompanying the CUDA 10.1 Update 1 CUDA Toolkit release.
Bug Fixes
- 2502854 Assignment of complex vector between host and device fails to compile in CUDA >=9.1 with GCC 6.
Thrust 1.9.4 (CUDA Toolkit 10.1)
Thrust 1.9.4 adds asynchronous interfaces for parallel algorithms, a new allocator system including caching allocators and unified memory support, as well as a variety of other enhancements, mostly related to C++11/C++14/C++17/C++20 support. The new asynchronous algorithms in the thrust::async
namespace return thrust::event
or thrust::future
objects, which can be waited upon to synchronize with the completion of the parallel operation.
Breaking API Changes
Synchronous Thrust algorithms now block until all of their operations have completed. Use the new asynchronous Thrust algorithms for non-blocking behavior.
New Features
-
thrust::event
andthrust::future<T>
, uniquely-owned asynchronous handles consisting of a state (ready or not ready), content (some value; forthrust::future
only), and an optional set of objects that should be destroyed only when the future's value is ready and has been consumed.- The design is loosely based on C++11's
std::future
. - They can be
.wait
'd on, and the value of a future can be waited on and retrieved with.get
or.extract
. - Multiple
thrust::event
s andthrust::future
s can be combined withthrust::when_all
. thrust::future
s can be converted tothrust::event
s.- Currently, these primitives are only implemented for the CUDA backend and are C++11 only.
- The design is loosely based on C++11's
-
New asynchronous algorithms that return
thrust::event
/thrust::future
s, implemented as C++20 range style customization points:thrust::async::reduce
.thrust::async::reduce_into
, which takes a target location to store the reduction result into.thrust::async::copy
, including a two-policy overload that allows explicit cross system copies which execution policy properties can be attached to.thrust::async::transform
.thrust::async::for_each
.thrust::async::stable_sort
.thrust::async::sort
.- By default the asynchronous algorithms use the new caching allocators. Deallocation of temporary storage is deferred until the destruction of the returned
thrust::future
. The content ofthrust::future
s is stored in either device or universal memory and transferred to the host only upon request to prevent unnecessary data migration. - Asynchronous algorithms are currently only implemented for the CUDA system and are C++11 only.
-
exec.after(f, g, ...)
, a new execution policy method that takes a set ofthrust::event
/thrust::future
s and returns an execution policy that operations on that execution policy should depend upon. -
New logic and mindset for the type requirements for cross-system sequence copies (currently only used by
thrust::async::copy
), based on:thrust::is_contiguous_iterator
andTHRUST_PROCLAIM_CONTIGUOUS_ITERATOR
for detecting/indicating that an iterator points to contiguous storage.thrust::is_trivially_relocatable
andTHRUST_PROCLAIM_TRIVIALLY_RELOCATABLE
for detecting/indicating that a type ismemcpy
able (based on principles from https://wg21.link/P1144).- The new approach reduces buffering, increases performance, and increases correctness.
- The fast path is now enabled when copying fp16 and CUDA vector types with
thrust::async::copy
.
-
All Thrust synchronous algorithms for the CUDA backend now actually synchronize. Previously, any algorithm that did not allocate temporary storage (counterexample:
thrust::sort
) and did not have a computation-dependent result (counterexample:thrust::reduce
) would actually be launched asynchronously. Additionally, synchronous algorithms that allocated temporary storage would become asynchronous if a custom allocator was supplied that did not synchronize on allocation/deallocation, unlikecudaMalloc
/cudaFree
. So, nowthrust::for_each
,thrust::transform
,thrust::sort
, etc are truly synchronous. In some cases this may be a performance regression; if you need asynchrony, use the new asynchronous algorithms. -
Thrust's allocator framework has been rewritten. It now uses a memory resource system, similar to C++17's
std::pmr
but supporting static polymorphism. Memory resources are objects that allocate untyped storage and allocators are cheap handles to memory resources in this new model. The new facilities live in<thrust/mr/*>
.thrust::mr::memory_resource<Pointer>
, the memory resource base class, which takes a (possibly tagged) pointer tovoid
type as a parameter.thrust::mr::allocator<T, MemoryResource>
, an allocator backed by a memory resource object.thrust::mr::polymorphic_adaptor_resource<Pointer>
, a type-erased memory resource adaptor.thrust::mr::polymorphic_allocator<T>
, a C++17-style polymorphic allocator backed by a type-erased memory resource object.- New tunable C++17-style caching memory resources,
thrust::mr::(disjoint_)?(un)?synchronized_pool_resource
, designed to cache both small object allocations and large repetitive temporary allocations. The disjoint variants use separate storage for management of the pool, which is necessary if the memory being allocated cannot be accessed on the host (e.g. device memory). - System-specific allocators were rewritten to use the new memory resource framework.
- New
thrust::device_memory_resource
for allocating device memory. - New
thrust::universal_memory_resource
for allocating memory that can be accessed from both the host and device (e.g.cudaMallocManaged
). - New
thrust::universal_host_pinned_memory_resource
for allocating memory that can be accessed from the host and the device but always resides in host memory (e.g.cudaMallocHost
). thrust::get_per_device_resource
andthrust::per_device_allocator
, which lazily create and retrieve a per-device singleton memory resource.- Rebinding mechanisms (
rebind_traits
andrebind_alloc
) forthrust::allocator_traits
. thrust::device_make_unique
, a factory function for creating astd::unique_ptr
to a newly allocated object in device memory.<thrust/detail/memory_algorithms>
, a C++11 implementation of the C++17 uninitialized memory algorithms.thrust::allocate_unique
and friends, based on the proposed C++23std::allocate_unique
(https://wg21.link/P0211).
-
New type traits and metaprogramming facilities. Type traits are slowly being migrated out of
thrust::detail::
and<thrust/detail/*>
; their new home will bethrust::
and<thrust/type_traits/*>
.thrust::is_execution_policy
.thrust::is_operator_less_or_greater_function_object
, which detectsthrust::less
,thrust::greater
,std::less
, andstd::greater
.thrust::is_operator_plus_function_object``, which detects
thrust::plusand
std::plus`.thrust::remove_cvref(_t)?
, a C++11 implementation of C++20'sthrust::remove_cvref(_t)?
.thrust::void_t
, and various other new type traits.thrust::integer_sequence
and friends, a C++11 implementation of C++20'sstd::integer_sequence
thrust::conjunction
,thrust::disjunction
, andthrust::disjunction
, a C++11 implementation of C++17's logical metafunctions.- Some Thrust type traits (such as
thrust::is_constructible
) have been redefined in terms of C++11's type traits when they are available.
-
<thrust/detail/tuple_algorithms.h>
, newstd::tuple
algorithms:thrust::tuple_transform
.thrust::tuple_for_each
.thrust::tuple_subset
.
-
Miscellaneous new
std::
-like facilities:thrust::optional
, a C++11 implementation of C++17'sstd::optional
.thrust::addressof
, an implementation of C++11'sstd::addressof
.thrust::next
andthrust::prev
, an implementation of C++11'sstd::next
andstd::prev
.thrust::square
, a<functional>
style unary function object that multiplies its argument by itself.<thrust/limits.h>
andthrust::numeric_limits
, a customized version of<limits>
andstd::numeric_limits
.
-
<thrust/detail/preprocessor.h>
, new general purpose preprocessor facilities:THRUST_PP_CAT[2-5]
, concatenates two to five tokens.THRUST_PP_EXPAND(_ARGS)?
, performs double expansion.THRUST_PP_ARITY
andTHRUST_PP_DISPATCH
, tools for macro overloading.THRUST_PP_BOOL
, boolean conversion.THRUST_PP_INC
andTHRUST_PP_DEC
, increment/decrement.THRUST_PP_HEAD
, a variadic macro that expands to the first argument.THRUST_PP_TAIL
, a variadic macro that expands to all its arguments after the first.THRUST_PP_IIF
, bitwise conditional.THRUST_PP_COMMA_IF
, andTHRUST_PP_HAS_COMMA
, facilities for adding and detecting comma tokens.THRUST_PP_IS_VARIADIC_NULLARY
, returns true if called with a nullary__VA_ARGS__
.THRUST_CURRENT_FUNCTION
, expands to the name of the current function.
-
New C++11 compatibility macros:
THRUST_NODISCARD
, expands to[[nodiscard]]
when available and the best equivalent otherwise.THRUST_CONSTEXPR
, expands toconstexpr
when available and the best equivalent otherwise.THRUST_OVERRIDE
, expands tooverride
when available and the best equivalent otherwise.THRUST_DEFAULT
, expands to= default;
when available and the best equivalent otherwise.THRUST_NOEXCEPT
, expands tonoexcept
when available and the best equivalent otherwise.THRUST_FINAL
, expands tofinal
when available and the best equivalent otherwise.THRUST_INLINE_CONSTANT
, expands toinline constexpr
when available and the best equivalent otherwise.
-
<thrust/detail/type_deduction.h>
, new C++11-only type deduction helpers:THRUST_DECLTYPE_RETURNS*
, expand to function definitions with suitable conditionalnoexcept
qualifiers and trailing return types.THRUST_FWD(x)
, expands to::std::forward<decltype(x)>(x)
.THRUST_MVCAP
, expands to a lambda move capture.THRUST_RETOF
, expands to a decltype computing the return type of an invocable.
New ...
Thrust 1.9.3 (CUDA Toolkit 10.0)
Thrust 1.9.3 unifies and integrates CUDA Thrust and GitHub Thrust.
Bug Fixes
- #725, #850, #855, #859, #860: Unify the
thrust::iter_swap
interface and fixthrust::device_reference
swapping. - NVBug 2004663: Add a
data
method tothrust::detail::temporary_array
and refactor temporary memory allocation in the CUDA backend to be exception and leak safe. - #886, #894, #914: Various documentation typo fixes.
- #724: Provide
NVVMIR_LIBRARY_DIR
environment variable to NVCC. - #878: Optimize
thrust::min/max_element
to only usethrust::detail::get_iterator_value
for non-numeric types. - #899: Make
thrust::cuda::experimental::pinned_allocator
's comparison operatorsconst
. - NVBug 2092152: Remove all includes of
<cuda.h>
. - #911: Fix default comparator element type for
thrust::merge_by_key
.
Acknowledgments
- Thanks to Andrew Corrigan for contributing fixes for swapping interfaces.
- Thanks to Francisco Facioni for contributing optimizations for
thrust::min/max_element
.
Thrust 1.9.2 (CUDA Toolkit 9.2)
Thrust 1.9.2 brings a variety of performance enhancements, bug fixes and test improvements. CUB 1.7.5 was integrated, enhancing the performance of thrust::sort
on small data types and thrust::reduce
. Changes were applied to complex
to optimize memory access. Thrust now compiles with compiler warnings enabled and treated as errors. Additionally, the unit test suite and framework was enhanced to increase coverage.
Breaking Changes
- The
fallback_allocator
example was removed, as it was buggy and difficult to support.
New Features
<thrust/detail/alignment.h>
, utilities for memory alignment:thrust::aligned_reinterpret_cast
.thrust::aligned_storage_size
, which computes the amount of storage needed for an object of a particular size and alignment.thrust::alignment_of
, a C++03 implementation of C++11'sstd::alignment_of
.thrust::aligned_storage
, a C++03 implementation of C++11'sstd::aligned_storage
.thrust::max_align_t
, a C++03 implementation of C++11'sstd::max_align_t
.
Bug Fixes
- NVBug 200385527, NVBug 200385119, NVBug 200385113, NVBug 200349350, NVBug 2058778: Various compiler warning issues.
- NVBug 200355591:
thrust::reduce
performance issues. - NVBug 2053727: Fixed an ADL bug that caused user-supplied
allocate
to be overlooked butdeallocate
to be called with GCC <= 4.3. - NVBug 1777043: Fixed
thrust::complex
to work withthrust::sequence
.
Thrust 1.9.1-2 (CUDA Toolkit 9.1)
Thrust 1.9.1 integrates version 1.7.4 of CUB and introduces a new CUDA backend for thrust::reduce
based on CUB.
Bug Fixes
- NVBug 1965743: Remove unnecessary static qualifiers.
- NVBug 1940974: Fix regression causing a compilation error when using
thrust::merge_by_key
withthrust::constant_iterator
s. - NVBug 1904217: Allow callables that take non-const refs to be used with
thrust::reduce
andthrust::*_scan
.
Thrust 1.9.0-5 (CUDA Toolkit 9.0)
Thrust 1.9.0 replaces the original CUDA backend (bulk) with a new one written using CUB, a high performance CUDA collectives library. This brings a substantial performance improvement to the CUDA backend across the board.
Breaking Changes
- Any code depending on CUDA backend implementation details will likely be broken.
New Features
- New CUDA backend based on CUB which delivers substantially higher performance.
thrust::transform_output_iterator
, a fancy iterator that applies a function to the output before storing the result.
New Examples
transform_output_iterator
demonstrates use of the new fancy iteratorthrust::transform_output_iterator
.
Other Enhancements
- When C++11 is enabled, functors do not have to inherit from
thrust::(unary|binary)_function
anymore to be used withthrust::transform_iterator
. - Added C++11 only move constructors and move assignment operators for
thrust::detail::vector_base
-based classes, e.g.thrust::host_vector
,thrust::device_vector
, and friends.
Bug Fixes
sin(thrust::complex<double>)
no longer has precision loss to float.
Acknowledgments
- Thanks to Manuel Schiller for contributing a C++11 based enhancement regarding the deduction of functor return types, improving the performance of
thrust::unique
and implementingthrust::transform_output_iterator
. - Thanks to Thibault Notargiacomo for the implementation of move semantics for the
thrust::vector_base
-based classes. - Thanks to Duane Merrill for developing CUB and helping to integrate it into Thrust's backend.