Skip to content
This repository has been archived by the owner on Mar 21, 2024. It is now read-only.

Releases: NVIDIA/thrust

Thrust 1.5.1 (CUDA Toolkit 4.1)

16 May 09:48
Compare
Choose a tag to compare

Thrust 1.5.1 is a minor bug fix release.

Bug Fixes

  • Sorting data referenced by permutation_iterators on CUDA produces invalid results

Thrust 1.5.0

16 May 09:44
Compare
Choose a tag to compare

Thrust 1.5.0 provides introduces new programmer productivity and performance enhancements. New functionality for creating anonymous "lambda" functions has been added. A faster host sort provides 2-10x faster performance for sorting arithmetic types on (single-threaded) CPUs. A new OpenMP sort provides 2.5x-3.0x speedup over the host sort using a quad-core CPU. When sorting arithmetic types with the OpenMP backend the combined performance improvement is 5.9x for 32-bit integers and ranges from 3.0x (64-bit types) to 14.2x (8-bit types). A new CUDA reduce_by_key implementation provides 2-3x faster performance.

Breaking Changes

  • device_ptr no longer unsafely converts to device_ptr without an explicit cast. Use the expression device_pointer_cast(static_cast<int*>(void_ptr.get())) to convert, for example, device_ptr to device_ptr.

New Features

  • Algorithms:
    • Stencil-less thrust::transform_if.
  • Lambda placeholders

New Examples

  • lambda

Other Enhancements

  • Host sort is 2-10x faster for arithmetic types
  • OMP sort provides speedup over host sort
  • reduce_by_key is 2-3x faster
  • reduce_by_key no longer requires O(N) temporary storage
  • CUDA scan algorithms are 10-40% faster
  • host_vector and device_vector are now documented
  • out-of-memory exceptions now provide detailed information from CUDART
  • improved histogram example
  • device_reference now has a specialized swap
  • reduce_by_key and scan algorithms are compatible with discard_iterator

Bug Fixes

  • #44 allow host_vector to compile when value_type uses __align__
  • #198 allow adjacent_difference to permit safe in-situ operation
  • #303 make thrust thread-safe
  • #313 avoid race conditions in device_vector::insert
  • #314 avoid unintended adl invocation when dispatching copy
  • #365 fix merge and set operation failures

Known Issues

  • None

Acknowledgments

  • Thanks to Manjunath Kudlur for contributing his Carbon library, from which the lambda functionality is derived.
  • Thanks to Jean-Francois Bastien for suggesting a fix for #303.

Thrust 1.4.0 (CUDA Toolkit 4.0)

16 May 09:42
Compare
Choose a tag to compare

Thrust 1.4.0 is the first release of Thrust to be included in the CUDA Toolkit. Additionally, it brings many feature and performance improvements. New set theoretic algorithms operating on sorted sequences have been added. Additionally, a new fancy iterator allows discarding redundant or otherwise unnecessary output from algorithms, conserving memory storage and bandwidth.

Breaking Changes

  • Eliminations
    • thrust/is_sorted.h
    • thrust/utility.h
    • thrust/set_intersection.h
    • thrust/experimental/cuda/ogl_interop_allocator.h and the functionality therein
    • thrust::deprecated::copy_when
    • thrust::deprecated::absolute_value
    • thrust::deprecated::copy_when
    • thrust::deprecated::absolute_value
    • thrust::gather and thrust::scatter from host to device and vice versa are no longer supported.
    • Operations which modify the elements of a thrust::device_vector are no longer available from source code compiled without nvcc when the device backend is CUDA. Instead, use the idiom from the cpp_interop example.

New Features

  • Algorithms:

    • thrust::copy_n
    • thrust::merge
    • thrust::set_difference
    • thrust::set_symmetric_difference
    • thrust::set_union
  • Types

    • thrust::discard_iterator
  • Device Support:

    • Compute Capability 2.1 GPUs.

New Examples

  • run_length_decoding

Other Enhancements

  • Compilation warnings are substantially reduced in various contexts.
  • The compilation time of thrust::sort, thrust::stable_sort, thrust::sort_by_key, and thrust::stable_sort_by_key are substantially reduced.
  • A fast sort implementation is used when sorting primitive types with thrust::greater.
  • The performance of thrust::set_intersection is improved.
  • The performance of thrust::fill is improved on SM 1.x devices.
  • A code example is now provided in each algorithm's documentation.
  • thrust::reverse now operates in-place

Bug Fixes

  • #212: thrust::set_intersection works correctly for large input sizes.
  • #275: thrust::counting_iterator and thrust::constant_iterator work correctly with OpenMP as the backend when compiling with optimization.
  • #256: min and max correctly return their first argument as a tie-breaker
  • #248: NDEBUG is interpreted incorrectly

Known Issues

  • NVCC may generate code containing warnings when compiling some Thrust algorithms.
  • When compiling with -arch=sm_1x, some Thrust algorithms may cause NVCC to issue benign pointer advisories.
  • When compiling with -arch=sm_1x and -G, some Thrust algorithms may fail to execute correctly.
  • thrust::inclusive_scan, thrust::exclusive_scan, thrust::inclusive_scan_by_key, and thrust::exclusive_scan_by_key are currently incompatible with thrust::discard_iterator.

Acknowledgments

  • Thanks to David Tarjan for improving the performance of set_intersection.
  • Thanks to Duane Merrill for continued help with sort.
  • Thanks to Nathan Whitehead for help with CUDA Toolkit integration.

Thrust 1.3.0

16 May 09:41
Compare
Choose a tag to compare

Thrust 1.3.0 provides support for CUDA Toolkit 3.2 in addition to many feature and performance enhancements. Performance of the sort and sort_by_key algorithms is improved by as much as 3x in certain situations. The performance of stream compaction algorithms, such as copy_if, is improved by as much as 2x. CUDA errors are now converted to runtime exceptions using the system_error interface. Combined with a debug mode, also new in 1.3, runtime errors can be located with greater precision. Lastly, a few header files have been consolidated or renamed for clarity. See the deprecations section below for additional details.

Breaking Changes

  • Promotions
    • thrust::experimental::inclusive_segmented_scan has been renamed thrust::inclusive_scan_by_key and exposes a different interface
    • thrust::experimental::exclusive_segmented_scan has been renamed thrust::exclusive_scan_by_key and exposes a different interface
    • thrust::experimental::partition_copy has been renamed thrust::partition_copy and exposes a different interface
    • thrust::next::gather has been renamed thrust::gather
    • thrust::next::gather_if has been renamed thrust::gather_if
    • thrust::unique_copy_by_key has been renamed thrust::unique_by_key_copy
  • Deprecations
    • thrust::copy_when has been renamed thrust::deprecated::copy_when
    • thrust::absolute_value has been renamed thrust::deprecated::absolute_value
    • The header thrust/set_intersection.h is now deprecated; use thrust/set_operations.h instead
    • The header thrust/utility.h is now deprecated; use thrust/swap.h instead
    • The header thrust/swap_ranges.h is now deprecated; use thrust/swap.h instead
  • Eliminations
    • thrust::deprecated::gather
    • thrust::deprecated::gather_if
    • thrust/experimental/arch.h and the functions therein
    • thrust/sorting/merge_sort.h
    • thrust/sorting/radix_sort.h
  • NVCC 2.3 is no longer supported

New Features

  • Algorithms:

    • thrust::exclusive_scan_by_key
    • thrust::find
    • thrust::find_if
    • thrust::find_if_not
    • thrust::inclusive_scan_by_key
    • thrust::is_partitioned
    • thrust::is_sorted_until
    • thrust::mismatch
    • thrust::partition_point
    • thrust::reverse
    • thrust::reverse_copy
    • thrust::stable_partition_copy
  • Types:

    • thrust::system_error and related types.
    • thrust::experimental::cuda::ogl_interop_allocator.
    • thrust::bit_and, thrust::bit_or, and thrust::bit_xor.
  • Device Support:

    • GF104-based GPUs.

New Examples

  • opengl_interop.cu
  • repeated_range.cu
  • simple_moving_average.cu
  • sparse_vector.cu
  • strided_range.cu

Other Enhancements

  • Performance of thrust::sort and thrust::sort_by_key is substantially improved for primitive key types
  • Performance of thrust::copy_if is substantially improved
  • Performance of thrust::reduce and related reductions is improved
  • THRUST_DEBUG mode added
  • Callers of Thrust functions may detect error conditions by catching thrust::system_error, which derives from std::runtime_error
  • The number of compiler warnings generated by Thrust has been substantially reduced
  • Comparison sort now works correctly for input sizes > 32M
  • min & max usage no longer collides with <windows.h> definitions
  • Compiling against the OpenMP backend no longer requires nvcc
  • Performance of device_vector initialized in .cpp files is substantially improved in common cases
  • Performance of thrust::sort_by_key on the host is substantially improved

Bug Fixes

  • Debug device code now compiles correctly
  • thrust::uninitialized_copy and thrust::uninitialized_fill now dispatch constructors on the device rather than the host

Known Issues

  • #212 set_intersection is known to fail for large input sizes
  • partition_point is known to fail for 64b types with nvcc 3.2

Acknowledgments

  • Thanks to Duane Merrill for contributing a fast CUDA radix sort implementation
  • Thanks to Erich Elsen for contributing an implementation of find_if
  • Thanks to Andrew Corrigan for contributing changes which allow the OpenMP backend to compile in the absence of nvcc
  • Thanks to Andrew Corrigan, Cliff Wooley, David Coeurjolly, Janick Martinez Esturo, John Bowers, Maxim Naumov, Michael Garland, and Ryuta Suzuki for bug reports
  • Thanks to Cliff Woolley for help with testing

Thrust 1.2.1

16 May 09:32
Compare
Choose a tag to compare

Thrust 1.2.1 is a small bug fix release that is compatible with the CUDA Toolkit 3.1 release.

Known Issues

  • thrust::inclusive_scan and thrust::exclusive_scan may fail with very large types.
  • MSVC may fail to compile code using both sort and binary search algorithms.
  • thrust::uninitialized_fill and thrust::uninitialized_copy dispatch constructors on the host rather than the device.
  • #109: Some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads.
  • thrust::default_random_engine::discard is not accelerated with NVCC 2.3
  • NVCC 3.1 may fail to compile code using types derived from thrust::subtract_with_carry_engine, such as thrust::ranlux24 and thrust::ranlux48.

Thrust 1.2.0

16 May 09:17
Compare
Choose a tag to compare

Thrust 1.2.0 introduces support for compilation to multicore CPUs and the Ocelot virtual machine, and several new facilities for pseudo-random number generation. New algorithms such as set intersection and segmented reduction have also been added. Lastly, improvements to the robustness of the CUDA backend ensure correctness across a broad set of (uncommon) use cases.

Breaking Changes

  • thrust::gather's interface was incorrect and has been removed. The old interface is deprecated but will be preserved for Thrust version 1.2 at thrust::deprecated::gather & thrust::deprecated::gather_if. The new interface is provided at thrust::next::gather & thrust::next::gather_if. The new interface will be promoted to thrust:: in Thrust version 1.3. For more details, please refer to this thread.
  • The thrust::sorting namespace has been deprecated in favor of the top-level sorting functions, such as thrust::sort and thrust::sort_by_key.
  • Removed support for thrust::equal between host & device sequences.
  • Removed support for thrust::scatter between host & device sequences.

New Features

  • Algorithms:
    • thrust::reduce_by_key
    • thrust::set_intersection
    • thrust::unique_copy
    • thrust::unique_by_key
    • thrust::unique_copy_by_key
  • Types
  • Random Number Generation:
    • thrust::discard_block_engine
    • thrust::default_random_engine
    • thrust::linear_congruential_engine
    • thrust::linear_feedback_shift_engine
    • thrust::subtract_with_carry_engine
    • thrust::xor_combine_engine
    • thrust::minstd_rand
    • thrust::minstd_rand0
    • thrust::ranlux24
    • thrust::ranlux48
    • thrust::ranlux24_base
    • thrust::ranlux48_base
    • thrust::taus88
    • thrust::uniform_int_distribution
    • thrust::uniform_real_distribution
    • thrust::normal_distribution (experimental)
  • Function Objects:
    • thrust::project1st
    • thrust::project2nd
  • thrust::tie
  • Fancy Iterators:
    • thrust::permutation_iterator
    • thrust::reverse_iterator
  • Vector Functions:
    • operator!=
    • rbegin
    • crbegin
    • rend
    • crend
    • data
    • shrink_to_fit
  • Device Support:
    • Multicore CPUs via OpenMP.
    • Fermi-class GPUs.
    • Ocelot virtual machines.
  • Support for NVCC 3.0.

New Examples

  • cpp_integration
  • histogram
  • mode
  • monte_carlo
  • monte_carlo_disjoint_sequences
  • padded_grid_reduction
  • permutation_iterator
  • row_sum
  • run_length_encoding
  • segmented_scan
  • stream_compaction
  • summary_statistics
  • transform_iterator
  • word_count

Other Enhancements

  • Integer sorting performance is improved when max is large but (max - min) is
    small and when min is negative
  • Performance of thrust::inclusive_scan and thrust::exclusive_scan is
    improved by 20-25% for primitive types.

Bug Fixes

  • #8 cause a compiler error if the required compiler is not found rather than a mysterious error at link time
  • #42 device_ptr & device_reference are classes rather than structs, eliminating warnings on certain platforms
  • #46 gather & scatter handle any space iterators correctly
  • #51 thrust::experimental::arch functions gracefully handle unrecognized GPUs
  • #52 avoid collisions with common user macros such as BLOCK_SIZE
  • #62 provide better documentation for device_reference
  • #68 allow built-in CUDA vector types to work with device_vector in pure C++ mode
  • #102 eliminated a race condition in device_vector::erase
  • various compilation warnings eliminated

Known Issues

  • inclusive_scan & exclusive_scan may fail with very large types
  • the Microsoft compiler may fail to compile code using both sort and binary search algorithms
  • uninitialized_fill & uninitialized_copy dispatch constructors on the host rather than the device
  • #109 some algorithms may exhibit poor performance with the OpenMP backend with large numbers (>= 6) of CPU threads
  • default_random_engine::discard is not accelerated with nvcc 2.3

Acknowledgments

  • Thanks to Gregory Diamos for contributing a CUDA implementation of set_intersection
  • Thanks to Ryuta Suzuki & Gregory Diamos for rigorously testing Thrust's unit tests and examples against Ocelot
  • Thanks to Tom Bradley for contributing an implementation of normal_distribution
  • Thanks to Joseph Rhoads for contributing the example summary_statistics

Thrust 1.1.1

16 May 09:15
Compare
Choose a tag to compare

Thrust 1.1.1 is a small bug fix release that is compatible with the CUDA Toolkit 2.3a release and Mac OSX Snow Leopard.

Thrust 1.1.0

16 May 09:14
Compare
Choose a tag to compare

Thrust 1.1.0 introduces fancy iterators, binary search functions, and several specialized reduction functions. Experimental support for segmented scans has also been added.

Breaking Changes

  • thrust::counting_iterator has been moved into the thrust namespace (previously thrust::experimental).

New Features

  • Algorithms:
    • thrust::copy_if
    • thrust::lower_bound
    • thrust::upper_bound
    • thrust::vectorized lower_bound
    • thrust::vectorized upper_bound
    • thrust::equal_range
    • thrust::binary_search
    • thrust::vectorized binary_search
    • thrust::all_of
    • thrust::any_of
    • thrust::none_of
    • thrust::minmax_element
    • thrust::advance
    • thrust::inclusive_segmented_scan (experimental)
    • thrust::exclusive_segmented_scan (experimental)
  • Types:
    • thrust::pair
    • thrust::tuple
    • thrust::device_malloc_allocator
  • Fancy Iterators:
    • thrust::constant_iterator
    • thrust::counting_iterator
    • thrust::transform_iterator
    • thrust::zip_iterator

New Examples

  • Computing the maximum absolute difference between vectors.
  • Computing the bounding box of a two-dimensional point set.
  • Sorting multiple arrays together (lexicographical sorting).
  • Constructing a summed area table.
  • Using thrust::zip_iterator to mimic an array of structs.
  • Using thrust::constant_iterator to increment array values.

Other Enhancements

  • Added pinned memory allocator (experimental).
  • Added more methods to host_vector & device_vector (issue #4).
  • Added variant of remove_if with a stencil argument (issue #29).
  • Scan and reduce use cudaFuncGetAttributes to determine grid size.
  • Exceptions are reported when temporary device arrays cannot be allocated.

Bug Fixes

  • #5: Make vector work for larger data types
  • #9: stable_partition_copy doesn't respect OutputIterator concept semantics
  • #10: scans should return OutputIterator
  • #16: make algorithms work for larger data types
  • #27: Dispatch radix_sort even when comp=less is explicitly provided

Thrust 1.0.0

16 May 09:11
Compare
Choose a tag to compare

First production release of Thrust.

Breaking Changes

  • Rename top level namespace komrade to thrust.
  • Move thrust::partition_copy & thrust::stable_partition_copy into thrust::experimental namespace until we can easily provide the standard interface.
  • Rename thrust::range to thrust::sequence to avoid collision with Boost.Range.
  • Rename thrust::copy_if to thrust::copy_when due to semantic differences with C++0x copy_if.

New Features

  • Add C++0x style cbegin & cend methods to thrust::host_vector and thrust::device_vector.
  • Add thrust::transform_if function.
  • Add stencil versions of thrust::replace_if & thrust::replace_copy_if.
  • Allow counting_iterator to work with thrust::for_each.
  • Allow types with constructors in comparison thrust::sort and thrust::reduce.

Other Enhancements

  • thrust::merge_sort and thrust::stable_merge_sort are now 2x to 5x faster when executed on the parallel device.

Bug Fixes

  • Komrade 6: Workaround an issue where an incremented iterator causes NVCC to crash.
  • Komrade 7: Fix an issue where const_iterators could not be passed to thrust::transform.