Releases: NVIDIA/thrust
Thrust 1.5.1 (CUDA Toolkit 4.1)
Thrust 1.5.1 is a minor bug fix release.
Bug Fixes
- Sorting data referenced by permutation_iterators on CUDA produces invalid results
Thrust 1.5.0
Thrust 1.5.0 provides introduces new programmer productivity and performance enhancements. New functionality for creating anonymous "lambda" functions has been added. A faster host sort provides 2-10x faster performance for sorting arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting arithmetic types with the OpenMP backend the combined performance improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x (8-bit types). A new CUDA reduce_by_key
implementation provides 2-3x faster performance.
Breaking Changes
- device_ptr no longer unsafely converts to device_ptr without an explicit cast. Use the expression device_pointer_cast(static_cast<int*>(void_ptr.get())) to convert, for example, device_ptr to device_ptr.
New Features
- Algorithms:
- Stencil-less
thrust::transform_if
.
- Stencil-less
- Lambda placeholders
New Examples
- lambda
Other Enhancements
- Host sort is 2-10x faster for arithmetic types
- OMP sort provides speedup over host sort
reduce_by_key
is 2-3x fasterreduce_by_key
no longer requires O(N) temporary storage- CUDA scan algorithms are 10-40% faster
host_vector
anddevice_vector
are now documented- out-of-memory exceptions now provide detailed information from CUDART
- improved histogram example
device_reference
now has a specialized swapreduce_by_key
and scan algorithms are compatible withdiscard_iterator
Bug Fixes
- #44 allow
host_vector
to compile whenvalue_type
uses__align__
- #198 allow
adjacent_difference
to permit safe in-situ operation - #303 make thrust thread-safe
- #313 avoid race conditions in
device_vector::insert
- #314 avoid unintended adl invocation when dispatching copy
- #365 fix merge and set operation failures
Known Issues
- None
Acknowledgments
- Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
- Thanks to Jean-Francois Bastien for suggesting a fix for #303.
Thrust 1.4.0 (CUDA Toolkit 4.0)
Thrust 1.4.0 is the first release of Thrust to be included in the CUDA Toolkit. Additionally, it brings many feature and performance improvements. New set theoretic algorithms operating on sorted sequences have been added. Additionally, a new fancy iterator allows discarding redundant or otherwise unnecessary output from algorithms, conserving memory storage and bandwidth.
Breaking Changes
- Eliminations
thrust/is_sorted.h
thrust/utility.h
thrust/set_intersection.h
thrust/experimental/cuda/ogl_interop_allocator.h
and the functionality thereinthrust::deprecated::copy_when
thrust::deprecated::absolute_value
thrust::deprecated::copy_when
thrust::deprecated::absolute_value
thrust::gather
andthrust::scatter
from host to device and vice versa are no longer supported.- Operations which modify the elements of a thrust::device_vector are no longer available from source code compiled without nvcc when the device backend is CUDA. Instead, use the idiom from the cpp_interop example.
New Features
-
Algorithms:
thrust::copy_n
thrust::merge
thrust::set_difference
thrust::set_symmetric_difference
thrust::set_union
-
Types
thrust::discard_iterator
-
Device Support:
- Compute Capability 2.1 GPUs.
New Examples
- run_length_decoding
Other Enhancements
- Compilation warnings are substantially reduced in various contexts.
- The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key, and thrust::stable_sort_by_key are substantially reduced.
- A fast sort implementation is used when sorting primitive types with thrust::greater.
- The performance of thrust::set_intersection is improved.
- The performance of thrust::fill is improved on SM 1.x devices.
- A code example is now provided in each algorithm's documentation.
- thrust::reverse now operates in-place
Bug Fixes
- #212:
thrust::set_intersection
works correctly for large input sizes. - #275:
thrust::counting_iterator
andthrust::constant_iterator
work correctly with OpenMP as the backend when compiling with optimization. - #256:
min
andmax
correctly return their first argument as a tie-breaker - #248:
NDEBUG
is interpreted incorrectly
Known Issues
- NVCC may generate code containing warnings when compiling some Thrust algorithms.
- When compiling with
-arch=sm_1x
, some Thrust algorithms may cause NVCC to issue benign pointer advisories. - When compiling with
-arch=sm_1x
and -G, some Thrust algorithms may fail to execute correctly. thrust::inclusive_scan
,thrust::exclusive_scan
,thrust::inclusive_scan_by_key
, andthrust::exclusive_scan_by_key
are currently incompatible withthrust::discard_iterator
.
Acknowledgments
- Thanks to David Tarjan for improving the performance of set_intersection.
- Thanks to Duane Merrill for continued help with sort.
- Thanks to Nathan Whitehead for help with CUDA Toolkit integration.
Thrust 1.3.0
Thrust 1.3.0 provides support for CUDA Toolkit 3.2 in addition to many feature and performance enhancements. Performance of the sort and sort_by_key algorithms is improved by as much as 3x in certain situations. The performance of stream compaction algorithms, such as copy_if, is improved by as much as 2x. CUDA errors are now converted to runtime exceptions using the system_error interface. Combined with a debug mode, also new in 1.3, runtime errors can be located with greater precision. Lastly, a few header files have been consolidated or renamed for clarity. See the deprecations section below for additional details.
Breaking Changes
- Promotions
- thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
- thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
- thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
- thrust::next::gather has been renamed thrust::gather
- thrust::next::gather_if has been renamed thrust::gather_if
- thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
- Deprecations
- thrust::copy_when has been renamed thrust::deprecated::copy_when
- thrust::absolute_value has been renamed thrust::deprecated::absolute_value
- The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
- The header thrust/utility.h is now deprecated; use thrust/swap.h instead
- The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
- Eliminations
- thrust::deprecated::gather
- thrust::deprecated::gather_if
- thrust/experimental/arch.h and the functions therein
- thrust/sorting/merge_sort.h
- thrust/sorting/radix_sort.h
- NVCC 2.3 is no longer supported
New Features
-
Algorithms:
thrust::exclusive_scan_by_key
thrust::find
thrust::find_if
thrust::find_if_not
thrust::inclusive_scan_by_key
thrust::is_partitioned
thrust::is_sorted_until
thrust::mismatch
thrust::partition_point
thrust::reverse
thrust::reverse_copy
thrust::stable_partition_copy
-
Types:
thrust::system_error
and related types.thrust::experimental::cuda::ogl_interop_allocator
.thrust::bit_and
,thrust::bit_or
, andthrust::bit_xor
.
-
Device Support:
- GF104-based GPUs.
New Examples
- opengl_interop.cu
- repeated_range.cu
- simple_moving_average.cu
- sparse_vector.cu
- strided_range.cu
Other Enhancements
- Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
- Performance of thrust::copy_if is substantially improved
- Performance of thrust::reduce and related reductions is improved
- THRUST_DEBUG mode added
- Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
- The number of compiler warnings generated by Thrust has been substantially reduced
- Comparison sort now works correctly for input sizes > 32M
- min & max usage no longer collides with <windows.h> definitions
- Compiling against the OpenMP backend no longer requires nvcc
- Performance of device_vector initialized in .cpp files is substantially improved in common cases
- Performance of thrust::sort_by_key on the host is substantially improved
Bug Fixes
- Debug device code now compiles correctly
- thrust::uninitialized_copy and thrust::uninitialized_fill now dispatch constructors on the device rather than the host
Known Issues
- #212 set_intersection is known to fail for large input sizes
- partition_point is known to fail for 64b types with nvcc 3.2
Acknowledgments
- Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
- Thanks to Erich Elsen for contributing an implementation of find_if
- Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
- Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
- Thanks to Cliff Woolley for help with testing
Thrust 1.2.1
Thrust 1.2.1 is a small bug fix release that is compatible with the CUDA Toolkit 3.1 release.
Known Issues
thrust::inclusive_scan
andthrust::exclusive_scan
may fail with very large types.- MSVC may fail to compile code using both sort and binary search algorithms.
thrust::uninitialized_fill
andthrust::uninitialized_copy
dispatch constructors on the host rather than the device.- #109: Some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads.
thrust::default_random_engine::discard
is not accelerated with NVCC 2.3- NVCC 3.1 may fail to compile code using types derived from
thrust::subtract_with_carry_engine
, such asthrust::ranlux24
andthrust::ranlux48
.
Thrust 1.2.0
Thrust 1.2.0 introduces support for compilation to multicore CPUs and the Ocelot virtual machine, and several new facilities for pseudo-random number generation. New algorithms such as set intersection and segmented reduction have also been added. Lastly, improvements to the robustness of the CUDA backend ensure correctness across a broad set of (uncommon) use cases.
Breaking Changes
thrust::gather
's interface was incorrect and has been removed. The old interface is deprecated but will be preserved for Thrust version 1.2 atthrust::deprecated::gather
&thrust::deprecated::gather_if
. The new interface is provided atthrust::next::gather
&thrust::next::gather_if
. The new interface will be promoted tothrust::
in Thrust version 1.3. For more details, please refer to this thread.- The thrust::sorting namespace has been deprecated in favor of the top-level sorting functions, such as
thrust::sort
andthrust::sort_by_key
. - Removed support for
thrust::equal
between host & device sequences. - Removed support for
thrust::scatter
between host & device sequences.
New Features
- Algorithms:
thrust::reduce_by_key
thrust::set_intersection
thrust::unique_copy
thrust::unique_by_key
thrust::unique_copy_by_key
- Types
- Random Number Generation:
thrust::discard_block_engine
thrust::default_random_engine
thrust::linear_congruential_engine
thrust::linear_feedback_shift_engine
thrust::subtract_with_carry_engine
thrust::xor_combine_engine
thrust::minstd_rand
thrust::minstd_rand0
thrust::ranlux24
thrust::ranlux48
thrust::ranlux24_base
thrust::ranlux48_base
thrust::taus88
thrust::uniform_int_distribution
thrust::uniform_real_distribution
thrust::normal_distribution
(experimental)
- Function Objects:
thrust::project1st
thrust::project2nd
thrust::tie
- Fancy Iterators:
thrust::permutation_iterator
thrust::reverse_iterator
- Vector Functions:
operator!=
rbegin
crbegin
rend
crend
data
shrink_to_fit
- Device Support:
- Multicore CPUs via OpenMP.
- Fermi-class GPUs.
- Ocelot virtual machines.
- Support for NVCC 3.0.
New Examples
cpp_integration
histogram
mode
monte_carlo
monte_carlo_disjoint_sequences
padded_grid_reduction
permutation_iterator
row_sum
run_length_encoding
segmented_scan
stream_compaction
summary_statistics
transform_iterator
word_count
Other Enhancements
- Integer sorting performance is improved when max is large but (max - min) is
small and when min is negative - Performance of
thrust::inclusive_scan
andthrust::exclusive_scan
is
improved by 20-25% for primitive types.
Bug Fixes
- #8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
- #42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
- #46 gather & scatter handle any space iterators correctly
- #51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
- #52 avoid collisions with common user macros such as BLOCK_SIZE
- #62 provide better documentation for device_reference
- #68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
- #102 eliminated a race condition in device_vector::erase
- various compilation warnings eliminated
Known Issues
- inclusive_scan & exclusive_scan may fail with very large types
- the Microsoft compiler may fail to compile code using both sort and binary search algorithms
- uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
- #109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
- default_random_engine::discard is not accelerated with nvcc 2.3
Acknowledgments
- Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
- Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
- Thanks to Tom Bradley for contributing an implementation of normal_distribution
- Thanks to Joseph Rhoads for contributing the example summary_statistics
Thrust 1.1.1
Thrust 1.1.1 is a small bug fix release that is compatible with the CUDA Toolkit 2.3a release and Mac OSX Snow Leopard.
Thrust 1.1.0
Thrust 1.1.0 introduces fancy iterators, binary search functions, and several specialized reduction functions. Experimental support for segmented scans has also been added.
Breaking Changes
thrust::counting_iterator
has been moved into thethrust
namespace (previouslythrust::experimental
).
New Features
- Algorithms:
thrust::copy_if
thrust::lower_bound
thrust::upper_bound
thrust::vectorized lower_bound
thrust::vectorized upper_bound
thrust::equal_range
thrust::binary_search
thrust::vectorized binary_search
thrust::all_of
thrust::any_of
thrust::none_of
thrust::minmax_element
thrust::advance
thrust::inclusive_segmented_scan
(experimental)thrust::exclusive_segmented_scan
(experimental)
- Types:
thrust::pair
thrust::tuple
thrust::device_malloc_allocator
- Fancy Iterators:
thrust::constant_iterator
thrust::counting_iterator
thrust::transform_iterator
thrust::zip_iterator
New Examples
- Computing the maximum absolute difference between vectors.
- Computing the bounding box of a two-dimensional point set.
- Sorting multiple arrays together (lexicographical sorting).
- Constructing a summed area table.
- Using
thrust::zip_iterator
to mimic an array of structs. - Using
thrust::constant_iterator
to increment array values.
Other Enhancements
- Added pinned memory allocator (experimental).
- Added more methods to host_vector & device_vector (issue #4).
- Added variant of remove_if with a stencil argument (issue #29).
- Scan and reduce use cudaFuncGetAttributes to determine grid size.
- Exceptions are reported when temporary device arrays cannot be allocated.
Bug Fixes
Thrust 1.0.0
First production release of Thrust.
Breaking Changes
- Rename top level namespace
komrade
tothrust
. - Move
thrust::partition_copy
&thrust::stable_partition_copy
intothrust::experimental
namespace until we can easily provide the standard interface. - Rename
thrust::range
tothrust::sequence
to avoid collision with Boost.Range. - Rename
thrust::copy_if
tothrust::copy_when
due to semantic differences with C++0xcopy_if
.
New Features
- Add C++0x style
cbegin
&cend
methods tothrust::host_vector
andthrust::device_vector
. - Add
thrust::transform_if
function. - Add stencil versions of
thrust::replace_if
&thrust::replace_copy_if
. - Allow
counting_iterator
to work withthrust::for_each
. - Allow types with constructors in comparison
thrust::sort
andthrust::reduce
.
Other Enhancements
thrust::merge_sort
andthrust::stable_merge_sort
are now 2x to 5x faster when executed on the parallel device.
Bug Fixes
- Komrade 6: Workaround an issue where an incremented iterator causes NVCC to crash.
- Komrade 7: Fix an issue where
const_iterator
s could not be passed tothrust::transform
.