Release [NIGHTLY] v23.02.00 · rapidsai/cudf

🔗 Links

🚨 Breaking Changes

Pin dask and distributed for release (#12695) @galipremsagar
Change ways to access ptr in Buffer (#12587) @galipremsagar
Remove column names (#12578) @vuule
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Remove deprecated code for 23.02 (#12281) @vyasr
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Remove JIT type names, refactor id_to_type. (#12158) @bdice
Floor division uses integer division for integral arguments (#12131) @wence-

🐛 Bug Fixes

Fix update-version.sh (#12745) @raydouglass
Fix a mask data corruption in UDF (#12647) @galipremsagar
pre-commit: Update isort version to 5.12.0 (#12645) @wence-
tests: Skip cuInit tests if cuda-gdb is not found or not working (#12644) @wence-
Revert regex program java APIs and tests (#12639) @cindyyuanjiang
Fix leaks in ColumnVectorTest (#12625) @jlowe
Handle when spillable buffers own each other (#12607) @madsbk
Fix incorrect null counts for sliced columns in JCudfSerialization (#12589) @jlowe
lists: Transfer dtypes correctly through list.get (#12586) @wence-
timedelta: Don't go via float intermediates for floordiv (#12585) @wence-
Fixing BUG, get_next_chunk() should use the blocking function device_read() (#12584) @madsbk
Make JNI QuoteStyle accessible outside ai.rapids.cudf (#12572) @mythrocks
partition_by_hash(): support index (#12554) @madsbk
Mixed Join benchmark bug due to wrong conditional column (#12553) @divyegala
Update List Lexicographical Comparator (#12538) @divyegala
Dynamically read PTX version (#12534) @brandon-b-miller
build.sh switch to use RAPIDS magic value (#12525) @robertmaynard
Loosen runtime arrow pinning (#12522) @vyasr
Enable metadata transfer for complex types in transpose (#12491) @galipremsagar
Fix issues with parquet chunked reader (#12488) @nvdbaranec
Fix missing metadata transfer in concat for ListColumn (#12487) @galipremsagar
Rename libcudf substring source files to slice (#12484) @davidwendt
Fix compile issue with arrow 10 (#12465) @ttnghia
Fix List offsets bug in mixed type list column in nested JSON reader (#12447) @karthikeyann
Fix xfail incompatibilities (#12423) @vyasr
Fix bug in Parquet column index encoding (#12404) @etseidl
When building Arrow shared look for a shared OpenSSL (#12396) @robertmaynard
Fix get_json_object to return empty column on empty input (#12384) @davidwendt
Pin arrow 9 in testing dependencies to prevent conda solve issues (#12377) @vyasr
Fix reductions any/all return value for empty input (#12374) @davidwendt
Fix debug compile errors in parquet.hpp (#12372) @davidwendt
Purge non-empty nulls in cudf::make_lists_column (#12370) @ttnghia
Use correct memory resource in io::make_column (#12364) @vyasr
Add code to detect possible malformed page data in parquet files. (#12360) @nvdbaranec
Fail loudly to avoid data corruption with unsupported input in read_orc (#12325) @vuule
Fix NumericPairIteratorTest for float values (#12306) @davidwendt
Fixes memory allocation in nested JSON tokenizer (#12300) @elstehle
Reconstruct dtypes correctly for list aggs of struct columns (#12290) @wence-
Fix regex \A and \Z to strictly match string begin/end (#12282) @davidwendt
Fix compile issue in json_chunked_reader.cpp (#12280) @ttnghia
Change reductions any/all to return valid values for empty input (#12279) @davidwendt
Only exclude join keys that are indices from key columns (#12271) @wence-
Fix spill to device limit (#12252) @madsbk
Correct behaviour of sort in concat for singleton concatenations (#12247) @wence-
Purge non-empty nulls for superimpose_nulls and push_down_nulls (#12239) @ttnghia
Patch CUB DeviceSegmentedSort and remove workaround (#12234) @davidwendt
Fix memory leak in udf_string::assign(&&) function (#12206) @davidwendt
Workaround thrust-copy-if limit in json get_tree_representation (#12190) @davidwendt
Fix page size calculation in Parquet writer (#12182) @etseidl
Add cudf::detail::sizes_to_offsets_iterator to allow checking overflow in offsets (#12180) @davidwendt
Workaround thrust-copy-if limit in wordpiece-tokenizer (#12168) @davidwendt
Floor division uses integer division for integral arguments (#12131) @wence-

📖 Documentation

Fix link to NVTX (#12598) @sameerz
Include missing groupby functions in documentation (#12580) @quasiben
Fix documentation author (#12527) @bdice
Update libcudf reduction docs for casting output types (#12526) @davidwendt
Add JSON reader page in user guide (#12499) @GregoryKimball
Link unsupported iteration API docstrings (#12482) @galipremsagar
strings_udf doc update (#12469) @brandon-b-miller
Update cudf_assert docs with correct NDEBUG behavior (#12464) @robertmaynard
Update pre-commit hooks guide (#12395) @bdice
Update test docs to not use detail comparison utilities (#12332) @PointKernel
Fix doxygen description for regex_program::compute_working_memory_size (#12329) @davidwendt
Add eval to docs. (#12322) @vyasr
Turn on xfail_strict=true (#12244) @wence-
Update 10 minutes to cuDF (#12114) @wence-

🚀 New Features

Use kvikIO as the default IO backend (#12574) @vuule
Use has_nonempty_nulls instead of may_contain_non_empty_nulls in superimpose_nulls and push_down_nulls (#12560) @ttnghia
Add strings methods removeprefix and removesuffix (#12557) @davidwendt
Add regex_program java APIs and unit tests (#12548) @cindyyuanjiang
Default cudf::io::read_json to nested JSON parser (#12544) @vuule
Make string quoting optional on CSV write (#12539) @mythrocks
Use new nvCOMP API to optimize the compression temp memory size (#12533) @vuule
Support "values" orient (array of arrays) in Nested JSON reader (#12498) @karthikeyann
one_hot_encode to use experimental row comparators (#12478) @divyegala
Support %W and %w format specifiers in cudf::strings::to_timestamps (#12475) @davidwendt
Add JSON Writer (#12474) @karthikeyann
Refactor thrust_copy_if into cudf::detail::copy_if_safe (#12455) @ttnghia
Add trailing comma support for nested JSON reader (#12448) @karthikeyann
Extract tokenize_json.hpp detail header from src/io/json/nested_json.hpp (#12432) @ttnghia
JNI bindings to write CSV (#12425) @mythrocks
Nested JSON depth benchmark (#12371) @karthikeyann
Implement lists::reverse (#12336) @ttnghia
Use device_read in experimental read_json (#12314) @vuule
Implement JNI for strings::reverse (#12283) @ttnghia
Null element for parsing error in numeric types in JSON, CSV reader (#12272) @karthikeyann
Add cudf::strings:like function with multiple patterns (#12269) @davidwendt
Add environment variable to control host memory allocation in hostdevice_vector (#12251) @vuule
Add cudf::strings::reverse function (#12227) @davidwendt
Selectively use dictionary encoding in Parquet writer (#12211) @etseidl
Support replace in strings_udf (#12207) @brandon-b-miller
Add support to read binary encoded decimals in parquet (#12205) @PointKernel
Support regex EOL where the string ends with a new-line character (#12181) @davidwendt
Updating stream_compaction/unique to use new row comparators (#12159) @divyegala
Add device buffer datasource (#12024) @PointKernel
Implement groupby apply with JIT (#11452) @bwyogatama

🛠️ Improvements

Update shared workflow branches (#12696) @ajschmidt8
Pin dask and distributed for release (#12695) @galipremsagar
Don't upload libcudf-example to Anaconda.org (#12671) @ajschmidt8
Pin wheel dependencies to same RAPIDS release (#12659) @sevagh
Use CTK 118/cp310 branch of wheel workflows (#12602) @sevagh
Change ways to access ptr in Buffer (#12587) @galipremsagar
Version a parquet writer xfail (#12579) @galipremsagar
Remove column names (#12578) @vuule
Parquet reader optimization to address V100 regression. (#12577) @nvdbaranec
Add support for category dtypes in CSV reader (#12571) @galipremsagar
Remove spill_lock parameter from SpillableBuffer.get_ptr() (#12564) @madsbk
Optimize cudf::make_lists_column (#12547) @ttnghia
Remove cudf::strings::repeat_strings_output_sizes from Java and JNI (#12546) @ttnghia
Test that cuInit is not called when RAPIDS_NO_INITIALIZE is set (#12545) @wence-
Rework repeat_strings to use sizes-to-offsets utility (#12543) @davidwendt
Replace exclusive_scan with sizes_to_offsets in cudf::lists::sequences (#12541) @davidwendt
Rework nvtext::ngrams_tokenize to use sizes-to-offsets utility (#12540) @davidwendt
Fix binary-ops gtests coded in namespace cudf::test (#12536) @davidwendt
More @acquire_spill_lock() and as_buffer(..., exposed=False) (#12535) @madsbk
Guard CUDA runtime APIs with error checking (#12531) @PointKernel
Update TODOs from issue 10432. (#12528) @bdice
Update rapids-cmake definitions version in GitHub Actions style checks. (#12511) @bdice
Switch engine=cudf to the new JSON reader (#12509) @galipremsagar
Fix SUM/MEAN aggregation type support. (#12503) @bdice
Stop using pandas._testing (#12492) @vyasr
Fix ROLLING_TEST gtests coded in namespace cudf::test (#12490) @davidwendt
Fix erroneously skipped ORC ZSTD test (#12486) @vuule
Rework nvtext::generate_character_ngrams to use make_strings_children (#12480) @davidwendt
Raise warnings as errors in the test suite (#12468) @vyasr
Remove int32 hard-coding in python (#12467) @galipremsagar
Use cudaMemcpyDefault. (#12466) @bdice
Update workflows for nightly tests (#12462) @ajschmidt8
Build CUDA 11.8 and Python 3.10 Packages (#12457) @ajschmidt8
JNI build image default as cuda11.8 (#12441) @pxLi
Re-enable Recently Updated Check (#12435) @ajschmidt8
Rework remaining cudf::strings::from_xyz functions to use make_strings_children (#12434) @vuule
Build wheels alongside conda CI (#12427) @sevagh
Remove arguments for checking exception messages in Python (#12424) @vyasr
Clean up cuco usage (#12421) @PointKernel
Fix warnings in remaining modules (#12406) @vyasr
Update ops-bot.yaml (#12402) @ajschmidt8
Rework cudf::strings::integers_to_ipv4 to use make_strings_children utility (#12401) @davidwendt
Use numpy.empty() instead of bytearray to allocate host memory for spilling (#12399) @madsbk
Deprecate chunksize from dask_cudf.read_csv (#12394) @rjzamora
Expose the RMM pool size in JNI (#12390) @revans2
Fix COPYING_TEST: gtests coded in namespace cudf::test (#12387) @davidwendt
Rework cudf::strings::url_encode to use make_strings_children utility (#12385) @davidwendt
Use make_strings_children in parse_data nested json reader (#12382) @karthikeyann
Fix warnings in test_datetime.py (#12381) @vyasr
Mixed Join Benchmarks (#12375) @divyegala
Fix warnings in dataframe.py (#12369) @vyasr
Update conda recipes. (#12368) @bdice
Use gpu-latest-1 runner tag (#12366) @bdice
Rework cudf::strings::from_booleans to use make_strings_children (#12365) @vuule
Fix warnings in test modules up to test_dataframe.py (#12355) @vyasr
JSON column performance optimization - struct column nulls (#12354) @karthikeyann
Accelerate stable-segmented-sort with CUB segmented sort (#12347) @davidwendt
Add size check to make_offsets_child_column utility (#12345) @davidwendt
Enable max compression ratio small block optimization for ZSTD (#12338) @vuule
Fix warnings in test_monotonic.py (#12334) @vyasr
Improve JSON column creation performance (list offsets) (#12330) @karthikeyann
Upgrade to arrow-10.0.1 (#12327) @galipremsagar
Fix warnings in test_orc.py (#12326) @vyasr
Fix warnings in test_groupby.py (#12324) @vyasr
Fix test_notebooks.sh (#12323) @ajschmidt8
Fix transform gtests coded in namespace cudf::test (#12321) @davidwendt
Fix check_style.sh script (#12320) @ajschmidt8
Rework cudf::strings::from_timestamps to use make_strings_children (#12317) @davidwendt
Fix warnings in test_index.py (#12313) @vyasr
Fix warnings in test_multiindex.py (#12310) @vyasr
CSV, JSON reader to infer integer column with nulls as int64 instead of float64 (#12309) @karthikeyann
Fix warnings in test_indexing.py (#12305) @vyasr
Fix warnings in test_joining.py (#12304) @vyasr
Unpin dask and distributed for development (#12302) @galipremsagar
Re-enable sccache for Jenkins builds (#12297) @ajschmidt8
Define needs for pr-builder workflow. (#12296) @bdice
Forward merge 22.12 into 23.02 (#12294) @vyasr
Fix warnings in test_stats.py (#12293) @vyasr
Fix table gtests coded in namespace cudf::test (#12292) @davidwendt
Change cython for regex calls to use cudf::strings::regex_program (#12289) @davidwendt
Improved error reporting when reading multiple JSON files (#12285) @vuule
Deprecate Frame.sum_of_squares (#12284) @vyasr
Remove deprecated code for 23.02 (#12281) @vyasr
Clean up handling of max_page_size_bytes in Parquet writer (#12277) @etseidl
Fix replace gtests coded in namespace cudf::test (#12270) @davidwendt
Add pandas nullable type support in Index.to_pandas (#12268) @galipremsagar
Rework nvtext::detokenize to use indexalator for row indices (#12267) @davidwendt
Fix reduction gtests coded in namespace cudf::test (#12257) @davidwendt
Remove default parameters from cudf::detail::sort function declarations (#12254) @davidwendt
Add duplicated support for Series, DataFrame and Index (#12246) @galipremsagar
Replace column/table test utilities with macros (#12242) @PointKernel
Rework cudf::strings::pad and zfill to use make_strings_children (#12238) @davidwendt
Fix sort gtests coded in namespace cudf::test (#12237) @davidwendt
Wrapping concat and file writes in @acquire_spill_lock() (#12232) @madsbk
Rename cudf::structs::detail::superimpose_parent_nulls APIs (#12230) @ttnghia
Cover parsing to decimal types in read_json tests (#12229) @vuule
Spill Statistics (#12223) @madsbk
Use CUDF_JNI_ENABLE_PROFILING to conditionally enable profiling support. (#12221) @bdice
Clean up of test_spilling.py (#12220) @madsbk
Simplify repetitive boolean logic (#12218) @vuule
Add Series.hasnans and Index.hasnans (#12214) @galipremsagar
Add cudf::strings:udf::replace function (#12210) @davidwendt
Adds in new java APIs for appending byte arrays to host columnar data (#12208) @revans2
Remove Python dependencies from Java CI. (#12193) @bdice
Fix null order in sort-based groupby and improve groupby tests (#12191) @divyegala
Move strings children functions from cudf/strings/detail/utilities.cuh to new header (#12185) @davidwendt
Clean up existing JNI scalar to column code (#12173) @revans2
Remove JIT type names, refactor id_to_type. (#12158) @bdice
Update JNI version to 23.02.0-SNAPSHOT (#12129) @pxLi
Minor refactor of cpp/src/io/parquet/page_data.cu (#12126) @etseidl
Add codespell as a linter (#12097) @benfred
Enable specifying exceptions in error macros (#12078) @vyasr
Move _label_encoding from Series to Column (#12040) @shwina
Add GitHub Actions Workflows (#12002) @ajschmidt8
Consolidate dask-cudf groupby_agg calls in one place (#10835) @charlesbluca

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[NIGHTLY] v23.02.00

🔗 Links

🚨 Breaking Changes

🐛 Bug Fixes

📖 Documentation

🚀 New Features

🛠️ Improvements

Contributors