[NIGHTLY] v23.02.00
Pre-release
Pre-release
rapids-bot
released this
09 Feb 16:17
·
4368 commits
to branch-25.04
since this release
π Links
π¨ Breaking Changes
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Remove column names (#12578) @vuule
- Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Remove deprecated code for 23.02 (#12281) @vyasr
- Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Floor division uses integer division for integral arguments (#12131) @wence-
π Bug Fixes
- Fix update-version.sh (#12745) @raydouglass
- Fix a mask data corruption in UDF (#12647) @galipremsagar
- pre-commit: Update isort version to 5.12.0 (#12645) @wence-
- tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
- Revert regex program java APIs and tests (#12639) @cindyyuanjiang
- Fix leaks in ColumnVectorTest (#12625) @jlowe
- Handle when spillable buffers own each other (#12607) @madsbk
- Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
- lists: Transfer dtypes correctly through list.get (#12586) @wence-
- timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
- Fixing BUG,
get_next_chunk()
should use the blocking functiondevice_read()
(#12584) @madsbk - Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash()
: support index (#12554) @madsbk- Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
- Update List Lexicographical Comparator (#12538) @divyegala
- Dynamically read PTX version (#12534) @brandon-b-miller
- build.sh switch to use
RAPIDS
magic value (#12525) @robertmaynard - Loosen runtime arrow pinning (#12522) @vyasr
- Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
- Fix issues with parquet chunked reader (#12488) @nvdbaranec
- Fix missing metadata transfer in concat for
ListColumn
(#12487) @galipremsagar - Rename libcudf substring source files to slice (#12484) @davidwendt
- Fix compile issue with arrow 10 (#12465) @ttnghia
- Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
- Fix xfail incompatibilities (#12423) @vyasr
- Fix bug in Parquet column index encoding (#12404) @etseidl
- When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
- Fix get_json_object to return empty column on empty input (#12384) @davidwendt
- Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
- Fix reductions any/all return value for empty input (#12374) @davidwendt
- Fix debug compile errors in parquet.hpp (#12372) @davidwendt
- Purge non-empty nulls in
cudf::make_lists_column
(#12370) @ttnghia - Use correct memory resource in io::make_column (#12364) @vyasr
- Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
- Fail loudly to avoid data corruption with unsupported input in
read_orc
(#12325) @vuule - Fix NumericPairIteratorTest for float values (#12306) @davidwendt
- Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
- Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
- Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
- Fix compile issue in
json_chunked_reader.cpp
(#12280) @ttnghia - Change reductions any/all to return valid values for empty input (#12279) @davidwendt
- Only exclude join keys that are indices from key columns (#12271) @wence-
- Fix spill to device limit (#12252) @madsbk
- Correct behaviour of sort in
concat
for singleton concatenations (#12247) @wence- - Purge non-empty nulls for
superimpose_nulls
andpush_down_nulls
(#12239) @ttnghia - Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
- Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
- Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
- Fix page size calculation in Parquet writer (#12182) @etseidl
- Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
- Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
- Floor division uses integer division for integral arguments (#12131) @wence-
π Documentation
- Fix link to NVTX (#12598) @sameerz
- Include missing groupby functions in documentation (#12580) @quasiben
- Fix documentation author (#12527) @bdice
- Update libcudf reduction docs for casting output types (#12526) @davidwendt
- Add JSON reader page in user guide (#12499) @GregoryKimball
- Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf
doc update (#12469) @brandon-b-miller- Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
- Update pre-commit hooks guide (#12395) @bdice
- Update test docs to not use detail comparison utilities (#12332) @PointKernel
- Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
- Add eval to docs. (#12322) @vyasr
- Turn on xfail_strict=true (#12244) @wence-
- Update 10 minutes to cuDF (#12114) @wence-
π New Features
- Use kvikIO as the default IO backend (#12574) @vuule
- Use
has_nonempty_nulls
instead ofmay_contain_non_empty_nulls
insuperimpose_nulls
andpush_down_nulls
(#12560) @ttnghia - Add strings methods removeprefix and removesuffix (#12557) @davidwendt
- Add
regex_program
java APIs and unit tests (#12548) @cindyyuanjiang - Default
cudf::io::read_json
to nested JSON parser (#12544) @vuule - Make string quoting optional on CSV write (#12539) @mythrocks
- Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
- Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode
to use experimental row comparators (#12478) @divyegala- Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
- Add JSON Writer (#12474) @karthikeyann
- Refactor
thrust_copy_if
intocudf::detail::copy_if_safe
(#12455) @ttnghia - Add trailing comma support for nested JSON reader (#12448) @karthikeyann
- Extract
tokenize_json.hpp
detail header fromsrc/io/json/nested_json.hpp
(#12432) @ttnghia - JNI bindings to write CSV (#12425) @mythrocks
- Nested JSON depth benchmark (#12371) @karthikeyann
- Implement
lists::reverse
(#12336) @ttnghia - Use
device_read
in experimentalread_json
(#12314) @vuule - Implement JNI for
strings::reverse
(#12283) @ttnghia - Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
- Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
- Add environment variable to control host memory allocation in
hostdevice_vector
(#12251) @vuule - Add cudf::strings::reverse function (#12227) @davidwendt
- Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
- Support
replace
instrings_udf
(#12207) @brandon-b-miller - Add support to read binary encoded decimals in parquet (#12205) @PointKernel
- Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
- Updating
stream_compaction/unique
to use new row comparators (#12159) @divyegala - Add device buffer datasource (#12024) @PointKernel
- Implement groupby apply with JIT (#11452) @bwyogatama
π οΈ Improvements
- Update shared workflow branches (#12696) @ajschmidt8
- Pin
dask
anddistributed
for release (#12695) @galipremsagar - Don't upload
libcudf-example
to Anaconda.org (#12671) @ajschmidt8 - Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
- Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
- Change ways to access
ptr
inBuffer
(#12587) @galipremsagar - Version a parquet writer xfail (#12579) @galipremsagar
- Remove column names (#12578) @vuule
- Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
- Add support for
category
dtypes in CSV reader (#12571) @galipremsagar - Remove
spill_lock
parameter fromSpillableBuffer.get_ptr()
(#12564) @madsbk - Optimize
cudf::make_lists_column
(#12547) @ttnghia - Remove
cudf::strings::repeat_strings_output_sizes
from Java and JNI (#12546) @ttnghia - Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
- Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
- Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
- Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
- Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
- More
@acquire_spill_lock()
andas_buffer(..., exposed=False)
(#12535) @madsbk - Guard CUDA runtime APIs with error checking (#12531) @PointKernel
- Update TODOs from issue 10432. (#12528) @bdice
- Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
- Switch
engine=cudf
to the newJSON
reader (#12509) @galipremsagar - Fix SUM/MEAN aggregation type support. (#12503) @bdice
- Stop using pandas._testing (#12492) @vyasr
- Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
- Fix erroneously skipped ORC ZSTD test (#12486) @vuule
- Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
- Raise warnings as errors in the test suite (#12468) @vyasr
- Remove
int32
hard-coding in python (#12467) @galipremsagar - Use cudaMemcpyDefault. (#12466) @bdice
- Update workflows for nightly tests (#12462) @ajschmidt8
- Build CUDA
11.8
and Python3.10
Packages (#12457) @ajschmidt8 - JNI build image default as cuda11.8 (#12441) @pxLi
- Re-enable
Recently Updated
Check (#12435) @ajschmidt8 - Rework remaining cudf::strings::from_xyz functions to use make_strings_children (#12434) @vuule
- Build wheels alongside conda CI (#12427) @sevagh
- Remove arguments for checking exception messages in Python (#12424) @vyasr
- Clean up cuco usage (#12421) @PointKernel
- Fix warnings in remaining modules (#12406) @vyasr
- Update
ops-bot.yaml
(#12402) @ajschmidt8 - Rework cudf::strings::integers_to_ipv4 to use make_strings_children utility (#12401) @davidwendt
- Use
numpy.empty()
instead ofbytearray
to allocate host memory for spilling (#12399) @madsbk - Deprecate chunksize from dask_cudf.read_csv (#12394) @rjzamora
- Expose the RMM pool size in JNI (#12390) @revans2
- Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
- Rework cudf::strings::url_encode to use make_strings_children utility (#12385) @davidwendt
- Use make_strings_children in parse_data nested json reader (#12382) @karthikeyann
- Fix warnings in test_datetime.py (#12381) @vyasr
- Mixed Join Benchmarks (#12375) @divyegala
- Fix warnings in dataframe.py (#12369) @vyasr
- Update conda recipes. (#12368) @bdice
- Use gpu-latest-1 runner tag (#12366) @bdice
- Rework cudf::strings::from_booleans to use make_strings_children (#12365) @vuule
- Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
- JSON column performance optimization - struct column nulls (#12354) @karthikeyann
- Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
- Add size check to make_offsets_child_column utility (#12345) @davidwendt
- Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
- Fix warnings in test_monotonic.py (#12334) @vyasr
- Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
- Upgrade to
arrow-10.0.1
(#12327) @galipremsagar - Fix warnings in test_orc.py (#12326) @vyasr
- Fix warnings in test_groupby.py (#12324) @vyasr
- Fix
test_notebooks.sh
(#12323) @ajschmidt8 - Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
- Fix
check_style.sh
script (#12320) @ajschmidt8 - Rework cudf::strings::from_timestamps to use make_strings_children (#12317) @davidwendt
- Fix warnings in test_index.py (#12313) @vyasr
- Fix warnings in test_multiindex.py (#12310) @vyasr
- CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
- Fix warnings in test_indexing.py (#12305) @vyasr
- Fix warnings in test_joining.py (#12304) @vyasr
- Unpin
dask
anddistributed
for development (#12302) @galipremsagar - Re-enable
sccache
for Jenkins builds (#12297) @ajschmidt8 - Define needs for pr-builder workflow. (#12296) @bdice
- Forward merge 22.12 into 23.02 (#12294) @vyasr
- Fix warnings in test_stats.py (#12293) @vyasr
- Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
- Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
- Improved error reporting when reading multiple JSON files (#12285) @vuule
- Deprecate Frame.sum_of_squares (#12284) @vyasr
- Remove deprecated code for 23.02 (#12281) @vyasr
- Clean up handling of max_page_size_bytes in Parquet writer (#12277) @etseidl
- Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
- Add pandas nullable type support in
Index.to_pandas
(#12268) @galipremsagar - Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
- Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
- Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
- Add
duplicated
support forSeries
,DataFrame
andIndex
(#12246) @galipremsagar - Replace column/table test utilities with macros (#12242) @PointKernel
- Rework cudf::strings::pad and zfill to use make_strings_children (#12238) @davidwendt
- Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
- Wrapping concat and file writes in
@acquire_spill_lock()
(#12232) @madsbk - Rename
cudf::structs::detail::superimpose_parent_nulls
APIs (#12230) @ttnghia - Cover parsing to decimal types in
read_json
tests (#12229) @vuule - Spill Statistics (#12223) @madsbk
- Use CUDF_JNI_ENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
- Clean up of
test_spilling.py
(#12220) @madsbk - Simplify repetitive boolean logic (#12218) @vuule
- Add
Series.hasnans
andIndex.hasnans
(#12214) @galipremsagar - Add cudf::strings:udf::replace function (#12210) @davidwendt
- Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
- Remove Python dependencies from Java CI. (#12193) @bdice
- Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
- Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
- Clean up existing JNI scalar to column code (#12173) @revans2
- Remove JIT type names, refactor id_to_type. (#12158) @bdice
- Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
- Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
- Add codespell as a linter (#12097) @benfred
- Enable specifying exceptions in error macros (#12078) @vyasr
- Move
_label_encoding
from Series to Column (#12040) @shwina - Add GitHub Actions Workflows (#12002) @ajschmidt8
- Consolidate dask-cudf
groupby_agg
calls in one place (#10835) @charlesbluca