Skip to content

Latest commit

 

History

History
1906 lines (1875 loc) · 209 KB

CHANGELOG_22.02_to_22.12.md

File metadata and controls

1906 lines (1875 loc) · 209 KB

Change log

Generated on 2023-02-08

Release 22.12

Features

#7275 [FEA] Support SaveIntoDataSourceCommand for Delta Lake
#5225 [FEA] Support array_remove
#6781 [FEA] Create demo notebook on Databricks for qualification tool usage
#6782 [FEA] Create demo notebook on Databricks for profiler tool usage
#6024 [FEA] Add support for Spark 3.2.3 SNAPSHOT
#6887 [FEA] support expressions parameter in substr function
#7078 [FEA] Add shims for Spark 3.2.3
#3037 [FEA] Support ZSTD compression with Parquet and Orc
#6916 [FEA] Support Coalesce on map column(s)
#6902 [FEA] Add shims for Spark 3.3.2
#6896 [FEA] Support Apache Spark 3.3.1
#6884 [FEA] Support instr
#6313 [FEA] Support mapInArrow introduced by pyspark 3.3.0+
#6064 [FEA] Qualification tool support parsing expressions (part 2)
#6645 [FEA] Qualification Tool: Print timestamp related functions.

Performance

#6794 Investigate other compression codecs and other serializers.
#6528 [FEA] Identify additional opportunities for using tiered projections
#6430 [FEA] look into using the new CUDF like operator
#7020 Fallback to CPU for Delta lake delta_log parquet checkpoint files
#6254 [FEA] Support z-ordering acceleration
#6524 [FEA] Improve tiered project by eliminating eclipsed columns in each tier
#6130 [FEA] More efficient bound check for GpuCast

Bugs Fixed

#6455 [BUG] Rapids tools test on Databricks fail
#6890 [BUG] RUN_DIR change fail some CI pipelines
#7085 [BUG] GPU Hive Text reader fails to read floating point input as integral types
#7271 [BUG] failed to build in Databricks runtime due to alluxio utils
#6636 [BUG] casting to string and list, and concat can cause overflow issues
#7234 [BUG] Integration test script failed on: '/tmp/20221204/python/lib': No such file or directory
#7198 [BUG] RapidsShuffleManager fails to unregister UCX-mode shuffle
#7168 [BUG] mismatch cpu and gpu result in test_aqe_join_reused_exchange_inequality_condition failed
#7066 [SPARK-39432][BUG] The test test_array_element_at_zero_index_fail fails on Spark 3.4
#7179 [BUG] Executors killed for out of memory with multithreaded RapidsShuffleManager
#7054 [BUG] Some tests in the AdaptiveQueryExecSuite fail on Spark 340
#7037 [BUG] AQE on Databricks failed the query with error "UnsupportedOperationException: ColumnarToRow does not implement doExecuteBroadcast"
#7150 [BUG] Spark 3.4 build fails
#7092 [BUG] java gateway crashed due to hash_aggregate_test case intermittently
#7140 [BUG] failed to echo PROJECT_VERSION in premerge CI
#7111 [BUG] Multithreaded shuffle keeps files around after RDDs are GCed
#7059 [BUG] Qualification - Incorrect parsing of conditional expressions
#6983 [BUG] query95 @ 30TB negative allocation from BaseHashJoinIterator.countGroups with default 200 partitions
#7036 [BUG] 30TB query95 fails on the join with illegal memory access with 200 partitions
#7065 [SPARK-38976][SPARK-40066][BUG] Some tests in the array_test.py fail on Spark 3.4 because the conf strictIndexOperator has been removed
#7044 [BUG] Qualification tool skips applications due to failure in expression parsing
#7026 [BUG] AnsiCastOpSuite 340 failures
#7039 [BUG] nz timestamp (MILLIS AND MICROS) fails on Spark 3.4
#7033 [BUG] GPU and CPU substring output different rows when pos + len < 0 && len >= 0
#7041 [BUG] regexp_test and many other test failures
#6425 [BUG] Host column leak detected in ParquetCachedBatchSerializer tests
#6906 [FEAT] Add tests for parquet reader code for all possible types
#6963 [BUG] Dynamic partition writer prevents GPU memory from being freed during write
#7014 [BUG] The unit test avg literals bools fail fails in Spark 340
#7003 [BUG] Alluxio config pathsToReplace does not overwrite automount config.
#6779 [BUG] Always read old data from alluxio regardless of S3 changes when using CONVERT_TIME replacement algorithm
#7010 [BUG] Parquet multi-threaded reader bufferTime is wrong
#6949 [BUG] Negative allocation error while stress testing with NDSv2 Query 9
#6995 [BUG] HostToGpuCoalesceIterator can sometimes close input batches
#4884 [BUG] Split by regular expressions with ? and * repetition are not consistent with Spark
#6452 [BUG] GPU writes more records than maxRecordsPerFile limit while CPU performs well
#6951 [BUG] cast_test.py::test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark 3.3.0+
#6880 [BUG] Regular expressions should support escaped forward slash \/ (and any other "invalid" escape chars)
#6537 [BUG] per-sql unit-tests need to be added to the test generator
#6933 [BUG] Tools run with filter arguments should handle corrupted log that doesn't have SparkListenerApplicationStart event
#3143 [BUG] DPP is not working in Databricks env
#6895 [BUG] Profile tool fails in getMaxTaskInputSizeBytes
#6871 [BUG] Parquet reader - Found no metadata for schema index
#6883 [BUG] integration test fail in CDH env due us trying to change permissions on /tmp/hive
#6752 [BUG] StringOperatorsSuite failed when building with JDK17
#6671 [Audit][BUG] Handle updated messageParameters for any thrown Spark exceptions in Spark 3.4.x
#6865 [BUG] parquet_write_test is failing when reading on the CPU parquet that was written on the GPU
#6856 [BUG] Can not switch Alluxio auto-mount option on the fly
#6869 [BUG] Building databricks failed
#6848 [BUG] github workflow actions use deprecated API "to be removed soon"
#6825 [BUG] pytests should configure hive.scratch.dir under RUN_DIR
#6818 [BUG] RapidsShuffleThreadedReader is not found when building the plugin with Spark 340
#6718 [BUG] test_iceberg_parquet_read_round_trip FAILED "TypeError: object of type 'NoneType' has no len()"
#6762 [BUG] The concurrent writer throws a class casting error when enabling AQE.
#6146 [BUG] intermittent orc test_read_round_trip failed due to /tmp/hive location
#2654 [BUG] --help at the end does not print out help for tools

PRs

#7337 Update 22.12 changelog to latest [skip ci]
#7316 Update jni version 22.12.0
#7237 [Doc]update download docs for v22.12 release[skip ci]
#7330 xfail all delta-write fallback cases [skip ci]
#7288 Add support for SaveIntoDataSource for Delta Lake 2.x
#7306 Cherry pick #7293 to 22.12 [skip ci]
#7270 Update 22.12 changelog to latest [skip ci]
#7264 Update columnar stats tracker API to pass file path for new batches
#7273 Fix AlluxioUtilsSuite build on Databricks for 22.12
#7250 Change tools hadoop version to 3.3.4
#7172 Add a document for how to view Alluxio metrics on UI [skip ci]
#7238 Add branch-specific premerge jenkinsfile
#7243 [Doc]fix broken links[skip ci]
#7155 Add unit tests for alluxio utils
#7080 [Doc] Document Alluxio does not sync metadata from S3 by default [skip ci]
#7235 Create tmp path to make python path explicit [skip ci]
#7084 [Doc]Update databricks doc for 22.12[skip ci]
#7166 Sync up spark2 explain code
#6903 Support projectV2 for changelog tooling [skip ci]
#7203 [Doc]add a Contact Us page at the top-level menu[skip ci]
#7174 Fix dependencies in jenkins-test script to support DB11.3
#7034 Read directly from S3 instead of reading from Alluxio caches if files are large and disk is slow
#7199 Fixes unregisterShuffle bugs in the driver and a missed match for the GpuResolver
#7156 Add scripts to run integration test on Databricks by leveraging Jenkins parallelism [skip ci]
#7195 Fix non-deterministic query in test_aqe_join_reused_exchange_inequality_condition
#7176 Copying common ThreadFactoryBuilder to tools to remove dependency
#7188 Remove "SNAPSHOT" for 323 shim
#7189 specify shim versions to build [skip ci]
#7165 Try/catch cudf file scan exceptions and re-throw with file metadata in message
#7164 Search for CudaFatalException in causes of failureReason in function onTaskFailed
#7169 Remove snapshot shims build in premerge script
#7180 multithreaded RapidsShuffleManager change when we release memory
#7115 Support array_remove operator
#7171 Add tests for 331 and 332
#7099 Update AQE tests to support Spark 3.4
#7110 Add GpuBroadcastToRowExec to handle columnar broadcast in cpu broadcast join with AQE enabled
#7153 Add SchemaUtilsShims
#7142 Restore hash aggregate tests after cub segmented sort fix
#7141 Get PROJECT_VERSION from version-def.sh [skip ci]
#7123 Reduce the duplication of RegExpShim and getFileScanRDD
#7145 Remove inaccurate warnings about fallbacks when using multithreaded shuffle
#7135 Revert "Suffix artifactId with amd64/arm64 for the dist jars [skip ci]] (#7070)(#7120)"
#7103 Add support to DB 11.3 ML LTS in databricks build script
#7125 Add missing cleanup of shuffle data when using multi-threaded shuffle
#6934 Add support for chunked parquet reading
#7120 Build noSnapshots without cdh shims on arm CPU [skip ci]
#7013 Hive delimited textfile read support
#7077 Add shims for Spark 3.2.3
#7070 Suffix artifactId with amd64/arm64 for the dist jars [skip ci]
#7088 Fix ConditionalExpr parser in Qualification tool
#7097 Use Databricks instance Spark version as default
#7107 Skip test_hash_groupby_collect_with_single_distinct [skip ci]
#7102 Skip test_hash_groupby_collect_partial_replace_with_distinct_fallback for #7092
#7051 Support non literal position and length for substring
#7067 Update the tests in array_test.py to adapt the removal of strictIndexOperator in Spark 3.4
#7071 Exception in SQLParser should not cause Qualification tool to skip app
#7025 Spark-3.4 - Fix cast unit tests
#7045 Fix parquet test for nztimestamp on spark 3.4.0
#7048 Enable tiered projections for GpuProjectExec
#7055 [Doc]update a typo for iceberg readme[skip ci]
#7049 add parenthesis around delta_log check to short circuit
#7052 Enable automerge 22.12 to 23.02 [skip ci]
#7040 Fix a substring issue for a corner case
#6960 Use cudf like operator in GpuLike operator
#7027 Include unit tests and integration tests in mvn-verify-check
#7022 Fallback to CPU when reading Delta delta_log parquet checkpoint files
#7031 Add skip test options [skip ci]
#7002 ParquetCachedBatchSerializer: Close the hostBatch in ColumnBatchToCachedBatchIterator when the iterator has exhausted
#6914 Add in tests to verify corner cases in parquet
#7016 Parse out positive and negative lookahead explicitly to fallback to GPU
#6999 Enable snapshot builds as optional PR checks
#6977 Close the batch in the writeBatch function of GpuDynamicPartitionDataSingleWriter
#7015 Fix the test failure of avg literals bools fail on Spark 3.4.0
#6362 [FEA] Add support for using nvcomp ZSTD compression
#7004 Alluxio pathsToReplace should has higher priority
#6806 Fix read old data from alluxio regardless of S3 changes when using CONVERT_TIME replacement algorithm
#7006 Revert "Fix a minor potential issue when rebatching for GpuArrowEvalP…
#7011 Fix buffertime for multi-threaded reader
#6950 Throw when onAllocFailure is invoked with invalid arguments
#7009 Work around column vectors reporting incorrect data type
#6996 Fix HostToGpuCoalesceIterator sometimes closing input batches
#6998 Make shim revision check opt-out
#6976 Update the docs of write and writebatch of ColumnOutputWriter
#7000 Spark-3.4: Update DecimalArithmeticOverrides to object
#6937 Removed PromotePrecision for Spark 3.4
#6959 Allow *, ?, and {0,...} variants in StringSplit in non-empty match situations
#6972 Add regular expression support for \d inside character classes on the GPU
#6922 Fix CastBase issues not related to PromotePrecision and CheckOverflow
#6966 Extract pre/post projections from columnar transitions
#6974 Add doc for mapInArrow [skip ci]
#6931 mergeSort late batch materialization and free already merged batches eagerly
#6971 Spark-3.4 : Fix build error in DataSourceV2ScanExec
#6901 Add JDK11 to mvn-verify-check
#6801 Enable the config MaxRecordsPerFile on the GpuDynamicDirectoryConcurrentWriter
#6952 Fix the failOnError not found error when building Spark 3.4.0
#6962 Stop using deprecated JDK API javax.xml.bind
#6957 Fix leak in GpuBroadcastExchangeExec
#6924 Shim for shaded protobuf orc-core
#6943 Mechanism to reduce redundancy in Maven profiles for shims
#6956 Throw SparkDateTimeException for invalid cast in Spark3.3+ versions
#6948 Pass through escaped punctuation in Regular Expression Transpiler
#6953 Remove unsupported format when converting dates/timestamps to strings [skip ci]
#6944 Update to a valid cuda docker image for k8s run [skip ci]
#6938 Spark-3.4 - Fix build errors in DataSourceStrategy and SparkDateTimeException
#6925 Only warn when hive scratch creation fails
#6923 [BUG] Fix qualification-test-result generators and update csv files
#6939 Support Coalesce on map column
#6936 Fixing exception when appStartInfo isn't available due to incomplete event log
#6824 Use alluxio Java API to mount instead of cmd
#6918 Added shim for Spark 3.3.2
#6919 Enable DPP and DPP+AQE on
#6920 Support Spark 3.3.1
#6905 Fix Spark 340 build error related to checkForNumericExpr
#6899 Add ApplicationSummaryInfo wrapper to allow mock tests
#6910 [FEA] Support string Instr function
#6913 [BUG] GpuPartitioning should close CVs before releasing semaphore
#6833 Flatten simple 4+ nesting of withResource
#6757 Add startupOnly tag to configs
#6893 Add different codepoint for unicode 13.0
#6892 Fix the Spark340 build error related to mapKeyNotExistError
#6897 Avoid coalescing files with mismatched schemas
#6889 Create target folder before attempting to add unique RUN_DIR
#6891 Remove invalid members from allow list [skip ci]
#6827 Follow on from recent regexp fixes to reject patterns that cuDF no longer rejects
#6876 Fix Spark 3.4 build issues
#6866 Use a unique run directory for each run when testing in run_pyspark_from_build
#6877 Plugin fixes after cuDF removed INT8 for binary columns in parquet writer
#6873 Add in support for zorder operators on databricks
#6857 Fix bug that can not switch Alluxio auto-mount option on the fly
#6860 Adjust to cudf removal of checks in scatter and repeat
#6823 Support columnar processing for mapInArrow
#6813 Move _databricks_internal check to shim layer
#6796 Qualification tool: Parse expressions in Join execs
#6861 Add check for is_spark_330cdh and update orc test to skip zstd for cdh
#6849 Cuda.deviceSynchronize as a last resort if we cannot spill enough
#6859 Reduce memory usage in aggregate.scala
#6870 Update the db hadoop jars version to 0007 for 10.4
#6867 Temporarily disable the failing tests of parquet writing.
#6855 Add the FileIndexOptions shims for Spark340
#6847 Fix integration builds failing with current directory not found
#6854 Fix setup-java step of blossom-ci [skip ci]
#6852 Fix deprecated Github actions API [skip ci]
#6700 Support zorder for deltalake and improve perf of range partitioning
#6826 Place hive scratch files under pytest $RUN_DIR
#6819 Move the RapidsShuffleThreadedReader from 330~340 shims to 330+ shims
#6810 Dump stack traces for tasks with the semaphore held when OOM goes unhandled
#6815 Update castPartValue function to fix ClassCastException
#6766 Adding timestamp functions into potential problems for qual tool
#6809 Relocate Scala files placed in the java/ directory
#6804 Fix auto merge conflict 6802 [skip ci]
#6751 Support columnar processing for FlatMapCoGroupInPandas
#6783 Revert "Temporarily xfail failing test_iceberg_parquet_read_round_trip test"
#6780 Fix auto merge conflict 6776
#6763 Fix a class casting error in concurrent writer when enabling AQE
#6760 Clean run directory before running tests in run_pyspark_from_build
#6716 Improve tiered project by eliminating eclipsed columns in each tier
#6764 Add supervisor(like systemd stuff) to auto restart Alluxio processes … [skip ci]
#6726 Provision hive scratch dir before test execution
#6730 Fix an unchecked conversion warning
#6756 Temporarily xfail failing test_iceberg_parquet_read_round_trip test
#6743 Add spark-rapids pulls to GitHub project [skip ci]
#6681 Fixes for more efficient bound checks for GpuCast
#6742 Rework for adding event log info for profiler output
#6717 Qualification tool: Parse expressions in Expand, Generate and TakeOrderedAndProject Execs
#6741 Reverse normalizing nan in the GpuSortArray
#6728 Disable maven-compiler-plugin
#6644 Simplify how we transpile negated character classes and add more tests
#6706 Adding new profiler output to map app with event log path
#6704 Removing --help tools tests that trigger System.exit()
#6675 Adding error handling to print help out when at end of command
#6667 Retain all heap dumps per JVM lifecycle
#6583 Update the GpuSingleDirectoryDataWriter and GpuDynamicDirectorySingleDataWriter to split ColumnarBatch when writing to match the maxRecordsPerFile
#6649 Update CUDF_VER to 22.12 for CI
#6613 Update project version to 22.12.0-SNAPSHOT

Release 22.10

Features

#6323 [FEA] AutoTuner Profiling Tool
#6544 [FEA] Update spark2 explain api code for 22.10
#6322 [FEA] Integrate AutoTuner into DataProc Rapids environment
#6401 [FEA] Support cast string to decimal(38,2)
#6170 [FEA] Qualification tool support plugin for running application
#6067 [FEA] Qualification Tool: For Databricks eventlog capture more information in output csv file
#6632 [FEA] Profiling tool: Suggest parameters to tune
#5305 [FEA] Qualification tool: Operator mapping, check if execs/expressions off by default
#5589 [FEA] GpuGlobalLimitExec and GpuCollectLimitExec support offset
#6264 [FEA] Qualification tool print unsupported execs and expressions
#5409 [FEA] Binary Data Write support for Parquet
#6400 [FEA] Windowing with decimal in orderBy.
#6529 [FEA] Update qualification speedup factors for CSP environments
#5096 [FEA] Support GroupBy Array[INT]
#6496 Allow filtering blocks to be done multithreaded in the parquet coalescing reader
#6392 [FEA] Support OptimizedCreateHiveTableAsSelectCommand (Hive CTAS with parquet)
#6395 [FEA] Remove the hasNans config from GpuCollectSet
#5416 [FEA] Support reading binary data types from Parquet as binary (not strings)
#4656 [FEA] Support Group-By on Array[String]
#5942 [FEA] Support multithreaded and coalescing read strategies for Apache Iceberg
#3974 [FEA] Fully implement multiply and divide for decimal128
#6164 [FEA] Add Nan handling in the GpuMin
#6142 [FEA] GpuAverage cannot guarantee proper overflow checks for a precision large than 23
#6144 [FEA] Support FromUTCTimestamp
#5559 [FEA] Add GpuMapConcat support for nested (array, struct, map) types.
#6143 [FEA] Avoid CPU fallback due to intermediate precision overflow when handling decimal
#4061 [FEA] Validate the size/complexity of regular expressions
#6145 [FEA] Avoid CPU fallback due to date_format:Failed to convert Unsupported word: SSS null.
#6300 [FEA] Profiling Tool supports recommendations for tuning
#6267 [FEA] Support ShuffleExchangeExec with BinaryType as input and output

Performance

#6708 [BUG] Regression in NDSv2 of 4% because of spillable broadcast
#5999 [FEA] [improvement] Investigate DynamicPartitionDataConcurrentWriter to avoid full sort when writing partitioned data
#6061 [FEA] PoC shuffle read/decompress performance
#4713 [FEA] Running window optimization for percent rank
#5085 Could we evaluate once the child expressions of GpuExtractChunk32
#6209 revisit locality wait = 0 setting
#5320 [FEA] fix issues so we can remove hasNans config
#6219 [FEA] Do not read the real data when readDataSchema is empty in Avro multi-threaded reading.

Bugs Fixed

#6727 [BUG] On SPARK-3.2.1 : java.lang.ClassCastException
#6748 [BUG] Casting strings CudfException: strings column has no children
#6614 [BUG] test_iceberg_read_parquet_compression_codec CPU and GPU output mismatched in PASCAL GPU
#6723 [BUG] null pointer exception selecting single column from iceberg table
#6693 [BUG] test_cast_string_to_negative_scale_decimal failed in nightly
#6692 [BUG] compile error deprecated method w/ jdk11
#6431 [BUG] Like does not work how we would like it to.
#6659 [BUG] Potential memory leaks in regexp_extract on the GPU
#6515 [BUG] RapidsShuffleThreadedWriterSuite failed to delete itermitent failure
#6621 [BUG] setting multi-threaded writer threads to 0 leads to divide-by-zero exception
#6508 [BUG] delta lake deletes/updates on Databricks can fail when using alluxio
#6637 [BUG] Qualification tool application time calculation can count stages twice if in separate sql queries
#6578 [BUG] Autotuner does not load worker-info from remote storage
#6592 [BUG] Delta Lake Deletes on Databricks broken with PERFILE parquet reader
#6593 [BUG] Avro tests using packages feature needs to enable snapshot repositories
#6539 Delta Lake and AQE on Databricks 10.4 workaround
#3328 [BUG] Segfault when partitioning empty batch
#6572 [BUG] UCX smoke tests can fail with OOM when initializing UCX
#6312 [BUG] Timestamp from GPU ORC reading is different from CPU ORC reading
#6270 [BUG] UPDATE on a Databricks (10.4) DELTA table leads to JVM crash
#6404 [BUG] DMLC XGBoost train FAILED against rapids-4-spark 22.10.0-SNAPSHOT FAILED
#6531 [BUG] window function of window function queries fail on Databricks 10.4
#6559 [BUG]EmptyHashedRelation$ cannot be cast to org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch
#6501 [BUG]cgroup directory permission get reverted on reboot
#6558 [BUG] orc_write_test.py::test_write_ cases failed
#6519 [BUG] Windowing skew caused GPU run OOM
#135 [BUG] mergeSchema on ORC reads does not work
#6302 [BUG] spark.sql.parquet.outputTimestampType is not considered during read/write parquet for nested types containing timestamp
#1059 [BUG] adaptive query executor and delta optimized table writes don't work on databricks
#6416 [BUG] Example Jupyter notebook fails to parse and contains errors
#5657 [BUG] Documented deployment of spark-avro is not tested
#6520 [BUG] NoClassDefFoundError: com/nvidia/spark/rapids/shims/PlanShims in UCX tests
#6397 [BUG] GpuBringBackToHost doExecute needs columnar conversion
#6460 [BUG] test_hash_grpby_sum_full_decimal fails
#6465 [BUG] orc_cast_test fails on CDH
#6478 [BUG] test_cast_float_to_timestamp_side_effect intermittently fails
#6372 [BUG] Decimal average excessively checks for overflow
#6467 [BUG] Fix DOP calculations for xdist
#6428 [BUG] IntervalDivisionSuite has memory leak
#6438 [BUG] GpuSortArray doesn't match the behavior of Spark when handling Nans
#6442 [BUG] java.lang.ClassNotFoundException: org.apache.spark.sql.rapids.execution.SerializeConcatHostBuffersDeserializeBatch
#6417 [BUG] CDH integration tests ClassNotFoundException: com.nvidia.spark.rapids.spark321cdh.RapidsShuffleManager
#6471 [BUG] Encrypted Parquet writes are not falling back if configs are set in configuration
#6433 [BUG] dist module "install" should install reduced pom
#6240 [BUG] shuffle file can not be deleted correctly when use RapidsShuffleManager.
#6446 [BUG] test_casting_from_integer[timestamp] fails on databricks321
#6426 [BUG] GpuShuffledHashJoinExecSuite has leaks
#6447 [BUG] Python UDF triggered java.lang.NullPointerException
#6406 [BUG] integration tests arithmetic_ops_test.test_day_time_interval_multiply_number failing
#6340 [BUG] test_hash_grpby_sum_full_decimal can fail with negative numbers
#6368 [BUG] It's confusing that BASE_SPARK_VERSION in jenkins/databricks/build.sh, but BASE_SPARK_VER in databricks/test.sh
#6351 [BUG] Implement escape characters for spark property encoding in PYSP_TEST env variables
#6284 [BUG] date_format cannot output with subsecond
#6341 [BUG] test_decimal_multiplication_mixed_no_overflow_guarantees fails for some negative values
#6303 [BUG] Coalescing readers don't include filterblock time in scan time metric
#6363 [BUG] missing zip utility on CI
#6073 [SPARK-39806][SQL] Accessing _metadata on partitioned table can crash a query
#6330 [BUG] withPsNote on ArrayMin does not appear in generated docs
#6332 [BUG] array_min does not fall back to CPU when hasNan = true
#6352 [BUG] Reading Binary Type in Iceberg table fallback to CPU
#6347 [BUG] test_delta_metadata_query_fallback failed in spark32X
#6359 [BUG] test_from_json_map failed
#5619 [BUG] Mixing parquet input files with different schemas results in crashes
#6344 [BUG] Iceberg tests fail due to duplication of spark.jarc conf via PYSP_TST and on the command line
#3851 [BUG] ShimLoader.updateSparkClassLoader fails with openjdk Java11
#5714 [BUG] discrepancy in the plugin jar deployment in run_pyspark_from_build.sh depending on TEST_PARALLEL
#6294 [BUG] Incorrect result when casting timestamp to string
#6165 [BUG] AnsiCastOpSuite fail in spark331 shim
#6308 [BUG] Integration tests failing on Spark 3.2 due to BinaryType
#6243 [BUG] AST fuzz test regexp find, replace fail
#6236 [BUG] integration tests corrupt executorEnv names containing underscore
#5706 [BUG] buildall --generate-bloop creates projects that Metals/Bloop does not recognize in VS code

PRs

#6907 [Doc]a hot fix for download links versions[skip ci]
#6803 Updated 22.10 changelog to latest [skip ci]
#6799 Update JNI version to released 22.10.0
#6755 [doc] Add diagnostic tool section to GCP Dataproc getting started page [skip ci]
#6734 Init 22.10 changelog [skip ci]
#6770 Revert "Docker container for ease of deployment to Databricks [skip ci]"
#6754 [Doc] update getting started guide for emr 6.8.0 release[skip ci]
#6772 [Doc]remove group on array in 22.10, target in 22.12[skip ci]
#6767 Avoid any issues with scalar values returned by evalColumnar
#6765 [DOC] Add gcp dataproc gpu limit [skip ci]
#6703 Docker container for ease of deployment to Databricks [skip ci]
#6750 Enabling decimal 38,2 casting
#6729 Fix NullPointerException in iceberg schema parsing code when selecting single column
#6724 Qualification tool: Read SQL function names for parsing expressions
#6695 [Doc] Adding Dataproc quick start steps to use new user tools package [skip ci]
#6719 Document that we test on JDK8 and JDK11, other versions are untested [skip ci]
#6721 Fix a couple of markdown links that are now permanently moved [skip ci]
#6701 Add AutoTuner documentation [skip ci]
#6709 Take semaphore after first stream batch is materialized (broadcast)
#6697 Fix AutoTuner yaml error handling and discovery script rounding
#6705 Suppress warning for jdk11 Finalize method deprecation
#6691 Fix validity checks for large decimal window bounds
#6670 [Doc]Add 22.10 download page[skip ci]
#6690 Update spark2 code for Revert "Add support for arrays in hashaggregate"
#6689 Fix the maxPartitionBytes recommendation by AutoTuner to use the max task input bytes
#6652 Revise AutoTuner to match the BootStrap tool
#6616 String to decimal casting custom kernel
#6679 Revert "Add support for arrays in hashaggregate (#6066)"
#6631 Fixes split estimation in explode/explode_outer
#6604 Make broadcast tables spillable
#6666 Fix resource leaks in regexp_extract_all
#6657 Add Qualification tool support for running application - per sql output
#6662 update spark2 code
#6643 Workflow to add new issues to Github global project [skip ci]
#6648 Update iceberg doc for split size options [docs]
#6640 Avoid failing test on cleanup when filesystem has issues
#6641 Fix case where number of shuffle writer threads is set to 0
#6638 Qualification tool: Print cluster usage tags to csv and log file
#6651 Changing toList to toIterator to improve memory optimization and runt…
#6601 delta lake deletes/updates on Databricks fail when using alluxio
#6642 Qualification tool application time calculation can count stages twice if in separate sql queries
#6606 Print nvidia-smi output when a task fails due to a cuda fatal exception.
#6630 Allow AutoTuner to accept remore path for WorkerInfo
#6627 Move spark331 back to list of snapshot shims
#6617 Fix a Delta Lake Deletes issue
#6612 Tolerate event log folder existence when to create it to avoid raisin…
#6610 Disable 22.10 snapshot builds
#6607 Enable tests that were missed when binary support was extended
#6584 Fix spark2-sql-plugin
#6506 Add alluxio reliability doc
#6609 Enable automerge from 22.10 to 22.12 [skip ci]
#6569 Add dynamic partition concurrent writer to avoid full sort
#6602 Fix version-def script to correctly set list of shims
#6599 Add in support for casting binary to string
#6432 [Doc]Add archived release page[skip ci]
#6594 Add Apache snapshot repository when running Avro tests
#6574 Add shim layer for Cloudera CDS 3.3
#6412 Qualification tool: Print unsupported Execs and expressions
#6590 Parallelize tests using spark packages feature
#6589 Update doc to indicate ORC and Parquet zstd read support [skip ci]
#6437 Use dist/pom file as source of truth for spark versions
#6587 Delta Lake and AQE on Databricks 10.4 workaround
#6573 Update UCX to 1.13.1 in CI and sets UCX_TLS=^posix
#6586 Adds link to spark supporting shuffle classes and fix copyright
#6545 Allow ORC tests to run with wider range of timestamp input
#6511 Multi-threaded shuffle reader for RapidsShuffleManager
#6576 Bump snakeyaml version to 1.32
#6577 Work around multiprocess issues with updating Ivy cache
#6579 Disable UCX smoke test temporarily
#6564 Fix the check of empty batches for partitioning
#6534 Add GpuColumnVectorUtils to access GpuColumnVector
#6575 Fix maxPartitionBytes bounds checking in AutoTuner
#6553 Update handling for projectList based WindowExecs to handle window function of window function
#6562 Handle EmptyRelation in GpuSubqueryBroadcastExec
#6504 [DOC] Add notes for cgroup permission reverted[skip ci]
#6554 Support Decimal ordering column for RANGE window functions
#6550 Allow percent_rank to not need an entire group in memory
#6557 Mitigate non-test failure and remove 21.xx premerge support
#6566 Fix map gen for orc_write_test.py
#6563 Add missing closing ``` for a code block [skip ci]
#6512 Remove the hasNans config and update the doc
#6542 [Doc]Doc update for databricks single node cluster[skip ci]
#6555 Document a safe unshimming algorithm [skip ci]
#6549 Update SnakeYaml version for bug fixes
#6523 ORC reading supports mergeSchema
#6522 Nightly spark-tests script to follow PYSP_TEST pattern [skip ci]
#6548 Fixes for recent cuDF regexp changes
#6541 Add another alluxio path replacement algorithm
#6547 Append new authorized user to blossom-ci whitelist [skip ci]
#6429 Fix up buffer time for multi-file readers
#6473 Fix parquet write when the input column is nested type containing timestamp
#6461 Enabling AQE on
#6436 Switch to gpu string to integer casts
#6538 Updating qual tool speedup factors from latest CSP benchmarks
#6421 Fix notebook and getting started examples [skip ci]
#6505 Include avro test by using '--packages' option [skip ci]
#6525 Fix typo in file name
#6527 Use ShimLoader to access PlanShims
#6466 Use tiered projections for hash aggregates
#6510 Revert "Added in very specific support for from_json to a Map<String,String> (#6211)"
#6319 Support float/double castings for ORC reading
#6498 Allow filtering blocks to be done multithreaded in the Parquet coalescing reader
#6507 Perform columnar-to-row transition in GpuBringBackToHost.doExecute
#6491 [DOC] Change recommend setting of spark.locality.wait to 3s [skip ci]
#6476 Add GPU acceleration for OptimizedCreateHiveTableAsSelect
#6499 Fix non-deterministic overflows in test_hash_grpby_sum_full_decimal
#6490 Fix: orc_cast_test fails on CDH
#6486 Remove the hasNans config from GpuCollectSet
#6484 Fixes excessive ShuffleBlockId object creation due to missing map index bounds
#6492 Fix intermittent failure on test_cast_float_to_timestamp_side_effect
#6483 Fix DOP calculation for xdist
#6479 Remove KnownFloatingPointNormalized from allow_non_gpu
#6482 Fix leak in interval divide
#6451 Normalize nans in GpuSortArray
#6066 Add support for arrays in hashaggregate
#6475 Change GpuKryoRegistrator to load the classes we want to register with the ShimLoader
#6472 Check more places for Parquet encryption configs
#6468 Use non-capture groups in LIKE regexp pattern
#6434 Install reduced pom for dist module
#6462 Increase stability of pytest run with PVC storage
#6454 Support bool/int8/16/32/64 castings for ORC reading
#6422 Iceberg supports coalescing reading for Parquet
#6450 Add new github ID to blossom-ci allow list [skip ci]
#6458 Change some Alluxio log messages to be debug
#6457 Reading delta log Table Checkpoint files should fallback the entire plan
#6439 Fix leaks in GpuShuffledHashJoinExecSuite
#6251 Add Nan handling in the GpuMin
#6449 Remove caching of needles in GpuInSet
#6414 Add support for full 128-bit decimal divide
#6448 Revert patch that caused failing test on databricks 321
#6441 Skip decimal gens that overflow on Spark 3.3.0+
#6273 Support bool/int8/int16/int32/int64 castings for ORC reading.
#6370 Support simple pass-through for FromUTCTimestamp
#6290 Add GpuMapConcat support for nested type keys.
#6405 Support more timestamp format when casting string to timestamp
#6418 Fix tests for DateTimeInterval that were overflowing on CPU
#6410 Fix handling of older array encodings in Parquet
#6398 Fix DecimalGen to generate full range and fix failing test cases
#6396 Make the variable "BASE_SPARK_VERSION" consistent
#6409 Fix test_dpp_from_swizzled_hash_keys on CDH
#6407 Remove empty unreferenced file unshimmed-spark311.txt
#6379 Rebalance time of parallel stages for pre-merge CI
#6358 Support _ in spark conf of integration tests
#6387 Use new custom kernel for large decimal multiply
#6355 Include filterblock time in scan time metric for Coalescing readers
#6393 Add zip&unzip in pre-merge dockerfile
#6374 Remove anthony-chang [skip ci]
#6349 Add Nan handling in GpuArrayMin
#6371 Fix datetime name collision in cast_test
#6361 Binary type support in Iceberg read
#6306 Struct null aware equality comparator <=> support
#6350 Allow writing Binary data in Parquet
#6365 Honor delta_lake marker for pytest
#6271 Add format SSS for date_format function
#6338 Adding AutoTuner to Profiling Tool
#6356 Fix auto merge conflict 6353 [skip ci]
#6342 Avoid passing duplicate conf to spark_init_internal
#6286 Change TimestampGen unit in integration test from millisecond to microsecond
#6335 Add missing subnet option to dataproc cluster example [skip ci]
#6307 Add more information in FileSourceScanExec log when timezone is not UTC
#5981 Run Delta Lake tests with Spark 3.2.x
#5646 Use Spark's Utils.getContextOrSparkClassLoader to load Shims
#6333 Make run_pyspark to report fail and error as default
#6044 [BUG] Fix IT discrepancy which depending on TEST_PARALLEL
#6311 Re-implement cast timestamp to string and add more tests
#6316 Add Nan handling for GpuArrayMax
#6256 [Bug] Add Expr OverflowInTableInsert to fix AnsiCastOpSuite
#6314 Increase robustness of mvn commands in nightly scripts
#6318 [BugFix]Change the RapidsDiskBlockManager in ShuffleBufferCatalog to guarantee the shuffle files can be cleaned successfully
#6006 Estimate and validate regular expression complexities
#6305 Increase robustness of MVN commands in pre-merge scripts
#6309 Add BinaryType to some shimmed expressions
#6062 Nested struct binary comparison operator support
#6298 Add BinaryType support to operations that already support arrays
#6297 Fix merge conflict with branch-22.08
#6241 Read metadata only when read schema is empty in Avro multi-threaded reading
#5989 Add NaN handling in GpuMax
#6203 Add config option to log all query transformations
#6246 Fix merge conflict with 22.08
#6247 regexp: Catch "nothing to repeat" errors nested in groups
#6237 Preserve underscore in executorEnv in integration tests
#6235 Fix merge conflict with branch-22.08
#6110 Iceberg Parquet supports multi-threaded reading.
#6227 Configurable task failures in integration tests
#6194 Make dist jar compression opt-out optional
#6211 Added in very specific support for from_json to a Map<String,String>
#6218 Disable overflow tableInsert tests for 331+
#6210 Fix merge conflict with branch-22.08
#6152 Improve coverage in mvn verify check github workflow
#6156 Fix Bloop project generation in buildall [skip ci]
#5946 GpuGlobalLimitExec and GpuCollectLimitExec support offset
#6162 Remove hard-coded versions from buildall [skip ci]
#6055 Add tests for .count() in the file readers
#6129 Init 22.10.0-SNAPSHOT

Release 22.08

Features

#6081 [FEA] Update spark2 code for 22.08
#5508 [FEA] collect_set on struct[Array]
#5222 [FEA] Support function array_except
#5228 [FEA] Support array_union
#5188 [FEA] Support arrays_overlap
#4932 [FEA] Support ArrayIntersect on at least Arrays of String
#4005 [FEA] Support First() in windowing context with Integer type
#5061 [FEA] Support last in windowing context for Integer type.
#6059 [FEA] Add SQL table to Qualification's app-details view
#5617 [FEA] Qualification tool support parsing expressions (part 1)
#4719 [FEA] GpuStringSplit: Add support for line and string anchors in regular expressions
#5502 [FEA] Qualification tool should use SQL ID of each Application ID like profiling tool
#5524 [FEA] Automatically adjust spark.rapids.sql.format.parquet.multiThreadedRead.numThreads to the same as spark.executor.cores
#4817 [FEA] Support Iceberg batch reads
#5510 [FEA] Support Iceberg for data INSERT, DELETE operations
#5890 [FEA] Mount the alluxio buckets/paths on the fly when the query is being executed
#6018 [FEA] Support Spark 3.2.2
#5417 [FEA] Fully support reading parquet binary as string
#4283 [FEA] Implement regexp_extract_all on GPU for idx > 0
#4353 [FEA] Implement regexp_extract_all on GPU for idx = 0
#5813 [FEA] Set sql.json.read.double.enabled and sql.csv.read.double.enabled to true by default
#4720 [FEA] GpuStringSplit: Add support for limit = 0 and limit =1
#5953 [FEA] Support Rocky Linux release
#5204 [FEA] Support Key vectors for GetMapValue and ElementAt for maps.
#4323 [FEA] Profiling tool add option to filter based on filesystem date
#5846 [FEA] Support null characters in regular expressions
#5904 [FEA] Add support for negated POSIX character classes in regular expressions
#5702 [FEA] Set spark.rapids.sql.explain=NOT_ON_GPU by default
#5867 [FEA] Add shim for Spark 3.3.1
#5628 [FEA] Enable Application detailed view in Qualification UI
#5831 [FEA] Update default speedup factors used for qualification tool
#4519 [FEA] Add regular expression support for Form Feed, Alert, and Escape control characters
#4040 [FEA] Support spark.sql.parquet.binaryAsString=true
#5797 [FEA] Support RoundCeil and RoundFloor when scale is zero
#4468 [FEA] Support repetition quantifiers ? and * with regexp_replace
#5679 [FEA] Support MMyyyy date/timestamp format
#4413 [FEA] Add support for POSIX characters in regular expressions
#4289 [FEA] Regexp: Add support for word and non-word boundaries in regexp pattern
#4517 [FEA] Add support for word boundaries \b and \B in regular expressions

Performance

#6060 [FEA] Add experimental multi-threaded BypassMergeSortShuffleWriter
#5453 [FEA] Support runtime filters for BatchScanExec
#5075 Performance can be very slow when reading just a few columns out of many on parquet
#5624 [FEA] Let CPU handle Delta table's metadata related queries
#4837 [FEA] Optimize JSON reading of floating-point values

Bugs Fixed

#6112 [BUG] UCX ubuntu dockerfile build failed
#6281 [BUG] Reading binary columns from nested types does not work.
#6282 [BUG] Missing CPU fallback for GetMapValue on scalar map, vector key
#6208 [BUG] test_array_intersect failed in databricks 10.4 runtime and Spark 3.3+
#6249 [BUG] test_array_union_before_spark313 failed in UCX job
#6232 [BUG] Query failed with java.lang.NullPointerException when doing GpuSubqueryBroadcastExec
#6230 [BUG] AQE does not respect entirePlanWillNotWork
#6131 [BUG] count() in avro failed when reader_types is coalescing
#6220 [BUG] Host buffer leak occurred when executing count with Avro multi-threaded reader
#6160 [BUG] When Hive table's actual data has varchar, but the DDL is string, then query fails to do varchar to string conversion
#6183 [BUG] Qualification UI uses single precision floating point
#6005 [BUG] When old Hive partition has different schema than new partition& Hive Schema, read old partition fails with "Found no metadata for schema index"
#6158 [BUG] AQE being used on Databricks even when its disabled
#6179 [BUG] Qualfication tool per sql output --num-output-rows option broken
#6157 [BUG] Pandas UDF hang in Databricks
#6167 [BUG] iceberg_test failed in nightly
#6128 [BUG] Can not ansi cast decimal type to long type while fetching decimal column from data table
#6029 [BUG] Query failed if reading a Hive partition table with partition key column is a Boolean data type, and if spark.rapids.alluxio.pathsToReplace is set
#6054 [BUG] Test Parquet nested unsigned int: uint8, uint16, uint32 FAILED in spark 320+
#6086 [BUG] checkValue does not work in RapidsConf
#6127 [BUG] regex_test failed in nightly
#6026 [BUG] Failed to cast value false to BooleanType for partition column k1
#5984 [BUG] DATABRICKS: NullPointerException: format is null in 22.08 (works fine with 22.06)
#6089 [BUG] orc_test is failing on Spark 3.2+
#5892 [BUG] When using Alluxio+Spark RAPIDS, if the S3 bucket is not mounted, then query will return nothing
#6056 [BUG] zstd integration tests failed for orc on Cloudera
#5957 [BUG] Exception calling collect() when partitioning using with arrays with null values using array_union(...)
#6017 [BUG] test_parquet_read_round_trip hanging forever in spark 32x standalone mode
#6035 [BUG] cache tests throws ClassCastException on Databricks
#6032 [BUG] Part of the plan is not columnar class org.apache.spark.sql.execution.ProjectExec failure
#6028 [BUG] regexp_test is failing in nightly tests
#3677 [BUG] PCBS does not fully follow the pattern for public classes
#6022 [BUG] test_iceberg_fallback_not_unsafe_row failed in databricks 10.4 runtime
#109 [BUG] GPU degreees function does not overflow
#5959 [BUG] test_parquet_read_encryption fails
#5493 [BUG] test_parquet_read_merge_schema failed w/ TITAN V
#5521 [BUG] Investigate regexp failures with unicode input
#5629 [BUG] regexp unicode tests require LANG=en_US.UTF-8 to pass
#5448 [BUG] partitioned writes require single batches and sorting, causing gpu OOM in some cases
#6003 [BUG] join_test failed in integration tests
#5979 [BUG] executors shutdown intermittently during integrations test parallel run
#5948 [BUG] GPU ORC reading fails when positional schema is enabled and more columns are required.
#5909 [BUG] Null characters do not work in regular expression character classes
#5956 [BUG] Warnings in build for GpuRegExpUtils with group_index
#4676 [BUG] Research associating MemoryCleaner to Spark's ShutdownHookManager
#5854 [BUG] Memory leaked in some test cases
#5937 [BUG] test_get_map_value_string_col_keys_ansi_fail in databricks321 runtime
#5891 [BUG] GpuShuffleCoalesce op time metric doesn't include concat batch time
#5896 [BUG] Profiling tool on taking a really long time for integration tests
#5939 [BUG] Qualification tool UI. Read Schema column is broken
#5711 [BUG] regexp: Build fails on CI when more characters added to fuzzer but not locally
#5929 [BUG] test_sorted_groupby_first_last failed in nightly tests
#5914 [BUG] test_parquet_compress_read_round_trip tests failed in spark320+
#5859 [BUG] Qualification tools csv order is not in sync
#5648 [BUG] compile-time references to classes potentially unavailable at run time
#5838 [BUG] Qualification ui output goes to wrong folder
#5855 [BUG] MortgageSparkSuite.scala set spark.rapids.sql.explain as true, which is invalid
#5630 [BUG] Qualification UI cannot render long strings
#5732 [BUG] fix estimated speed-up for not-applicable apps in Qualification results
#5788 [BUG] Qualification UI Sanitize template content
#5836 [BUG] string_test.py::test_re_replace_repetition failed IT
#5837 [BUG] test_parquet_read_round_trip_binary_as_string failures on YARN and Dataproc
#5726 [BUG] CastChecks.sparkIntegralSig has BINARY in it twice
#5775 [BUG] TimestampSuite is run on Spark 3.3.0 only
#5678 [BUG] Inconsistency between the time zone in the fallback reason and the actual time zone checked in RapidsMeta.checkTImeZoneId
#5688 [BUG] AnsiCast is merged into Cast in Spark 340, failing the 340 build
#5480 [BUG] Some arithmetic tests are failing on Spark 3.4.0
#5777 [BUG] repeated runs of mvn package without clean lead to missing spark-rapids-jni-version-info.properties in dist jar
#5456 [BUG] Handle regexp_replace inconsistency from https://issues.apache.org/jira/browse/SPARK-39107
#5683 [BUG] test_cast_neg_to_decimal_err failed in recent 22.08 tests
#5525 [BUG] Investigate more edge cases in regexp support
#5744 [BUG] Compile failure with Spark 3.2.2
#5707 [BUG] Fix shim-related bugs

PRs

#6376 Update 22.08 changelog to latest
#6367 Revert "Enable Strings as a supported type for GpuColumnarToRow transitions"
#6354 Update 22.08 changelog to latest [skip ci]
#6348 Update plugin jni version to released 22.08.0
#6234 [Doc] Add 22.08 docs' links [skip ci]
#6288 CPU fallback for Map scalars with key vectors
#6292 Fix parquet binary reads to do the transformation in the plugin
#6257 Fallback to CPU for Parquet reads with _databricks_internal columns
#6274 Use schema instead of row field count during columnar conversion
#6268 Apply BroadcastMode key projections before interpreting key expressions in subqueries
#6250 Fix bug where AQE does not respect entirePlanWillNotWork
#6248 Fix some issues with reading binary from parquet
#6239 Add rocky Dockerfiles and refine docker documentation
#6079 Add support for nested types to collect_set(...) on the GPU
#6215 Update Spark2 Explain API code for 22.08
#6161 Added binary read support for Parquet [Databricks]
#6222 Init 22.08 changelog [skip ci]
#6225 Fix count() in avro failed when reader_types is coalescing
#6216 [Doc] Update 22.08 documentation
#6223 Temporary fix for test_array_intersect failures on Spark 3.3.0
#6221 Release host buffers when Avro read schema is empty
#6132 [DOC]update outofdate mortgage notebooks and update docs for xgboost161 jar[skip ci]
#6188 Allow ORC conversion from VARCHAR to STRING
#6013 Add fixed issues to regex fuzzer
#5958 Add set based operations for arrays: array_intersect, array_union, array_except, and arrays_overlap for running on GPU
#6189 Qualification UI change floating precision [skip ci]
#6063 Fix Parquet schema evolution when missing column is in a nested type
#6159 Workaround for Databricks using AQE even when disabled
#6181 Fix the qualification tool per sql number output rows option
#6166 Update the configs used to choose the Python runner for flat-map Pandas UDF
#6169 Fix IcebergProvider classname in unshim exceptions
#6103 Fix crash when casting decimals to long
#6071 Update test_add_overflow_with_ansi_enabled and test_subtraction_overflow_with_ansi_enabled to check the exception type for Integral case.
#6136 Fix Alluxio inferring partitions for BooleanType with Hive
#6027 Re-enable "transpile complex regex 2" scala test
#6140 Update profile names in unit tests docs [skip ci]
#6141 Fixes threaded shuffle writer test mocks for spark 3.3.0+
#6147 Revert "Temporarily disable Parquet unsigned int test in ParquetScanS…
#6133 [DOC]update getting started guide doc for aws-emr670 release[skip ci]
#6007 Add doc for parsing expressions in qualification tool [skip ci]
#6125 Add SQL table to Qualification's app-details view [skip ci]
#6116 Fix: check validity before setting the default value
#6120 Qualification Tool add test for SQL Description escaping commas for csv
#6106 Qualification tool: Parse expressions in WindowExec
#6040 Enable anchors in regexp string split
#6052 Multi-threaded shuffle writer for RapidsShuffleManager
#5998 Enable Strings as a supported type for GpuColumnarToRow transitions
#6092 Qualification tool output recommendations on a per sql query basis
#6104 Revert to only supporting Apache Iceberg 0.13.x
#6111 Fix missed gnupg2 in ucx example dockerfiles [skip ci]
#6107 Disable snapshot shims build in 22.08
#6016 Automatically adjust spark.rapids.sql.multiThreadedRead.numThreads to the same as spark.executor.cores
#6098 Support Apache Iceberg 0.14.0
#6097 Fix 3.3 shim to include castTo handling AnyTimestampType and minor spacing
#6057 Tag GpuWindow child expressions for GPU execution
#6090 Add missing is_spark_321cdh import in orc_test
#6048 Port whole parsePartitions method from Spark3.3 to Gpu side
#5941 GPU accelerate Apache Iceberg reads
#5925 Add Alluxio auto mount feature
#6004 Check the existence of alluxio path
#6082 Enable auto-merge from branch-22.08 to branch-22.10 [skip ci]
#6058 Disable zstd orc tests in cdh
#6078 Temporarily disable Parquet unsigned int test in ParquetScanSuite
#6049 Fix test hang caused by parquet hadoop test jar log4j file
#6042 Qualification tool: Parse expressions in Aggregates and Sort execs.
#6041 Improve check for UTF-8 in integration tests by testing from the JVM
#5970 Address feedback in "Improve regular expression error messages" PR
#6000 Support nth_value, first and last in window context
#6031 Update spark322shim dependency to released lib
#6033 Refactor: Fix PCBS does not fully follow the pattern for public classes
#6019 Update the interval division to throw same type exceptions as Spark
#6030 Cleans up some of the redundant code in proxy/internal RAPIDS Shuffle Managers
#5988 [FEA] Add a progress bar in Qualification tool when it is running
#6020 Unify test modes in databricks test script
#6025 Skip Iceberg tests on Databricks
#5983 Adding AUTO native parquet support and legacy tests
#6010 Update docs to better explain limitations of Dataset support
#5996 Fix GPU degrees function does not overflow
#5994 Skip Parquet encryption read tests if Parquet version is less than 1.12
#5776 Enable regular expression support based on whether UTF-8 is in the current locale
#6009 Fix issue where spark-tests was producing an unintended error code
#5903 Avoid requiring single batch when using out-of-core sort
#6008 Rename test modes in spark-tests.sh [skip ci]
#5991 Enable zstd integration tests for parquet and orc
#5997 support testing parquet encryption
#5968 Add support for regexp_extract_all on GPU
#5995 Fix a minor potential issue when rebatching for GpuArrowEvalPythonExec
#5960 Set up the framework of type casting for ORC reading
#5987 Document how to check if finalized plan on GPU from user code / REPLs [skip ci]
#5982 Use the new native parquet footer API instead of the old one
#5972 [DOC] add app-details to qualification tools doc [skip ci]
#5976 Enable null in regex character classes
#5974 Remove scaladoc warning
#5912 Fall back to CPU for Delta Lake metadata queries
#5955 Fix fake memory leaks in some test cases
#5915 Make the error message of changing decimal type the same as Spark's
#5971 Append new authorized user to blossom-ci whitelist [skip ci]
#5967 [Doc]In Databricks doc, disable DPP config[skip ci]
#5871 Improve regular expression error messages
#5952 Qualification tool: Parse expressions in ProjectExec
#5961 Don't set spark.sql.ansi.strictIndexOperator to false for array subscript test
#5935 Enable reading double values on GPU when reading CSV and JSON
#5950 Fix GpuShuffleCoalesce op time metric doesn't include concat batch time
#5932 Add string split support for limit = 0 and limit =1
#5951 Fix issue with Profiling tool taking a long time due to finding stage ids that maps to sql nodes
#5954 Add IT dockerfile for rockylinux8 [skip ci]
#5949 Update GpuAdd and GpuSubtract to throw same type exception as Spark
#5878 Fix misleading documentation for approx_percentile and some other functions
#5913 Update gcp cluster init option [skip ci]
#5940 Qualification tool UI. fix Read-Schema column broken [skip ci]
#5938 Fix leaks in the test cases of CachedBatchWriterSuite
#5934 Add underscore to regexp fuzzer
#5936 [BUG] Fix databricks test report location
#5883 Add support for element_at and GetMapValue
#5918 Filter profiling tool based on start time.
#5926 Collect databricks test report
#5924 Changes made to the Audit process for prioritizing the commits [skip-ci]
#5834 Add support for null characters in regular expressions
#5930 Make first/last test for sorted deterministic
#5917 Improve sort removal heuristic for sort aggregate
#5916 Revert "Enable testing zstd for spark releases 3.2.0 and later (#5898)"
#5686 Add GpuMapConcat support for nested-type values
#5905 Add support for negated POSIX character classes \P
#5898 Enable testing parquet with zstd for spark releases 3.2.0 and later
#5900 Optimize some common if/else cases
#5869 Qualification: fix sorting and add unit-tests script
#5819 Modify the default value of spark.rapids.sql.explain as NOT_ON_GPU
#5723 Dynamically load hive and avro using reflection to avoid potential class not found exception
#5886 Avoid serializing plan in GpuCoalesceBatches, GpuHashAggregateExec, and GpuTopN
#5897 GpuBatchScanExec partitions should be marked transient
#5894 [Doc]fix a typo with double "("[skip ci]
#5880 Qualification tool: Parse expressions in FilterExec
#5885 [Doc] Fix alluxio doc link issue[skip ci]
#5879 Avoid duplicate sanitization step when reading JSON floats
#5877 Add Apache Spark 3.3.1-SNAPSHOT Shims
#5783 assertMinValueOverflow should throw same type of exception as Spark
#5875 Qualification ui output goes to wrong folder
#5870 Use a common thread pool across formats for multithreaded reads
#5868 Profiling tool add wholestagecodegen to execs mapping, sql to stage info and job end time
#5873 Correct the value of spark.rapids.sql.explain
#5695 Verify DPP over LIKE ANY/ALL expression
#5856 Update unit test doc
#5866 Fix CsvScanForIntervalSuite leak issues
#5810 Qualification UI - add application details view
#5860 [Doc]Add Spark3.3 support in doc[skip ci]
#5858 Remove SNAPSHOT support from Spark 3.3.0 shim
#5857 Remove user sperlingxx[skip ci]
#5841 Enable regexp empty string short circuit on shim version 3.1.3
#5853 Fix auto merge conflict 5850
#5845 Update Parquet binaryAsString integration to use a static parquet file
#5842 Update default speedup factors for qualification tool
#5829 Add regexp support for Alert, and Escape control characters
#5833 Add test for GpuCast canonicalization with timezone
#5822 Configure log4j version 2.x for test cases
#5830 Enable the spark.sql.parquet.binaryAsString=true configuration option on the GPU
#5805 [Issue 5726] Removing duplicate BINARY keyword
#5828 Update tools module to latest Hadoop version
#5809 Disable Spark 3.4.0 premerge for 22.08 and enable for 22.10
#5767 Fix the time zone check issue
#5814 Fix auto merge conflict 5812 [skip ci]
#5804 Support RoundCeil and RoundFloor when scale is zero
#5696 Support Parquet field IDs
#5749 Add shims for AnsiCast
#5780 Append new authorized user to blossom-ci whitelist [skip ci]
#5350 Halt Spark executor when encountering unrecoverable CUDA errors
#5779 Fix repeated runs mvn package without clean lead to missing spark-rapids spark-rapids-jni-version-info.properties in dist jar
#5800 Fix auto merge conflict 5799
#5794 Fix auto merge conflict 5789
#5740 Handle regexp_replace inconsistency with empty strings and zero-repetition patterns
#5790 Fix auto merge conflict 5789
#5690 Update the error checking of test_cast_neg_to_decimal_err
#5774 Fix merge conflict with branch-22.06
#5768 Support MMyyyy date/timestamp format
#5692 Add support for POSIX predefined character classes
#5762 Fix auto merge conflict 5759
#5754 Fix auto merge conflict 5752
#5450 Handle ?, *, {0,} and {0,n} based repetitions in regexp_replace on the GPU
#5479 Add support for word boundaries \b and \B
#5745 Move RapidsErrorUtils to org.apache.spark.sql.shims package
#5610 Fall back to CPU for unsupported regular expression edge cases with end of line/string anchors and newlines
#5725 Fix auto merge conflict 5724
#5687 Minor: Clean up GpuConcat
#5710 Fix auto merge conflict 5709
#5708 Fix shim-related bugs
#5700 Fix auto merge conflict 5699
#5675 Update the error messages for the failing arithmetic tests.
#5689 Disable 340 for premerge and nightly
#5603 Skip unshim and dedup of external spark-rapids-jni and jucx
#5472 Add shims for Spark 3.4.0
#5647 Init version 22.08.0-SNAPSHOT

Release 22.06

Features

#5451 [FEA] Update Spark2 explain code for 22.06
#5261 [FEA] Create MIG with Cgroups on YARN Dataproc scripts
#5476 [FEA] extend concat on arrays to all nested types.
#5113 [FEA] ANSI mode: Support CAST between types
#5112 [FEA] ANSI mode: allow casting between numeric type and timestamp type
#5323 [FEA] Enable floating point by default
#4518 [FEA] Add support for escaped unicode hex in regular expressions
#5405 [FEA] Support map_concat function
#5547 [FEA] Regexp: Can we transpile \W and \D to Java's definition so we can support on GPU?
#5512 [FEA] Qualification tool, hook up final output and output execs table
#5507 [FEA] Support GpuRaiseError
#5325 [FEA] Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys
#3682 [FEA] Use conventional jar layout in dist jar if there is only one input shim
#1556 [FEA] Implement ANSI mode tests for string to timestamp functions
#4425 [FEA] Support line anchor $ and string anchors \z and \Z in regexp_replace
#5176 [FEA] Qualification tool UI
#5111 [FEA] ANSI mode: CAST between ANSI intervals and IntegralType
#4605 [FEA] Add regular expression support for new character classes introduced in Java 8
#5273 [FEA] Support map_filter
#1557 [FEA] Enable ANSI mode for CAST string to date
#5446 [FEA] Remove hasNans check for array_contains
#5445 [FEA] Support reading Int as Byte/Short/Date from parquet
#5449 [FEA] QualificationTool. Add speedup information to AppSummaryInfo
#5322 [FEA] remove hasNans for Pivot
#4800 [FEA] Enable support for more regular expressions with \A and \Z
#5404 [FEA] Add Shim for the Spark version shipped with Cloudera CDH 7.1.7
#5226 [FEA] Support array_repeat
#5229 [FEA] Support arrays_zip
#5119 [FEA] Support ANSI mode for SQL functions/operators
#4532 [FEA] Re-enable support for \Z in regular expressions
#3985 [FEA] UDF-Compiler: Translation of simple predicate UDF should allow predicate pushdown
#5034 [FEA] Implement ExistenceJoin for BroadcastNestedLoopJoin Exec
#4533 [FEA] Re-enable support for $ in regular expressions
#5263 [FEA] Write out operator mapping from plugin to CSV file for use in qualification tool
#5095 [FEA] Support collect_set on struct in reduction context
#4811 [FEA] Support ANSI intervals for Cast and Sample
#2062 [FEA] support collect aggregations
#5060 [FEA] Support Count on Struct of [ Struct of [String, Map(String,String)], Array(String), Map(String,String) ]
#4528 [FEA] Add support for regular expressions containing \s and \S
#4557 [FEA] Add support for regexp_replace with back-references

Performance

#5148 Add the MULTI-THREADED reading support for avro
#5304 [FEA] Optimize remote Avro reading for a PartitionFile
#5257 [FEA][Audit] - [SPARK-34863][SQL] Support complex types for Parquet vectorized reader
#5149 Add the COALESCING reading support for avro

Bugs Fixed

#5769 [BUG] arithmetic ops tests failing on Spark 3.3.0
#5785 [BUG] Tests module build failed in OrcEncryptionSuite for 321cdh
#5765 [BUG] Container decimal overflow when casting float/double to decimal
#5246 Verify Parquet columnar encryption is handled safely
#5770 [BUG] test_buckets failed
#5733 [BUG] Integration test test_orc_write_encryption_fallback fail
#5719 [BUG] test_cast_float_to_timestamp_ansi_for_nan_inf failed in spark330
#5739 [BUG] Spark 3.3 build failure - QueryExecutionErrors package scope changed
#5670 [BUG] Job failed when parsing "java.lang.reflect.InvocationTargetException: org.apache.spark.sql.catalyst.parser.ParseException:"
#4860 [BUG] GPU writing ORC columns statistics
#5717 [BUG] div_by_zero test is failing on Spark 330 on 22.06
#5632 [BUG] udf_cudf tests failed: EOFException DataInputStream.readInt(DataInputStream.java:392)
#5672 [BUG] Read exception occurs when clipped schema is empty
#5694 [BUG] Inconsistent behavior with Spark when reading a non-existent column from Parquet
#5562 [BUG] read ORC file with various file schemas
#5654 [BUG] Transpiler produces regex pattern that cuDF cannot compile
#5655 [BUG] Regular expression pattern [&&1] produces incorrect results on GPU
#4862 [FEA] Add support for regular expressions containing octal digits inside character classes , eg[\0177]
#5615 [BUG] GpuBatchScanExec only reports output row metrics
#4505 [BUG] RegExp parse fails to parse character ranges containing escaped characters
#4865 [BUG] Add support for regular expressions containing hexadecimal digits inside character classes, eg [\x7f]
#5513 [BUG] NoClassDefFoundError with caller classloader off in GpuShuffleCoalesceIterator in local-cluster
#5530 [BUG] regexp: \d, \w inconsistencies with non-latin unicode input
#5594 [BUG] 3.3 test_div_overflow_exception_when_ansi test failures
#5596 [BUG] Shim service provider failure when using jar built with -DallowConventionalDistJar
#5582 [BUG] Nightly CI failed with : 'dist/target/rapids-4-spark_2.12-22.06.0-SNAPSHOT.jar' not exists
#5577 [BUG] test_cast_neg_to_decimal_err failing in databricks
#5557 [BUG] dist jar does not contain reduced pom, creates an unnecessary jar
#5474 [BUG] Spark 3.2.1 arithmetic_ops_test failures
#5497 [BUG] 3 tests in IntervalSuite are faling on 330
#5544 [BUG] GpuCreateMap needs to set hasSideEffects in some cases
#5469 [BUG] NPE during serialization for shuffle in array-aggregation-with-limit query
#5496 [BUG] avg literals bools is failing on 330
#5511 [BUG] orc_test failures on 321cdh
#5439 [BUG] Encrypted Parquet writes are being replaced with a GPU unencrypted write
#5108 [BUG] GpuArrayExists encounters a CudfException on an input partition consisting of just empty lists
#5492 [BUG] com.nvidia.spark.rapids.RegexCharacterClass cannot be cast to com.nvidia.spark.rapids.RegexCharacterClassComponent
#4818 [BUG] ASYNC: the spill store needs to synchronize on spills against the allocating stream
#5481 [BUG] test_parquet_check_schema_compatibility failed in databricks runtimes
#5482 [BUG] test_cast_string_date_invalid_ansi_before_320 failed in databricks runtime
#5457 [BUG] 330 AnsiCastOpSuite Unit tests failed 22 cases
#5098 [BUG] Harden calls to RapidsBuffer.free
#5464 [BUG] Query failure with java.lang.AssertionError when using partitioned Iceberg tables
#4746 [FEA] Add support for regular expressions containing octal digits in range \200 to 377
#5200 [BUG] More detailed logs to show which parquet file and which data type has mismatch.
#4866 [BUG] Add support for regular expressions containing hexadecimal digits greater than 0x7f
#5140 [BUG] NPE on array_max of transformed empty array
#5444 [BUG] build failed on Databricks
#5357 [BUG] Spark 3.3 cache_test test_passing_gpuExpr_as_Expr[failures
#5429 [BUG] test_cache_expand_exec fails on Spark 3.3
#5312 [BUG] The coalesced AVRO file may contain different sync markers if the sync marker varies in the avro files being coalesced.
#5415 [BUG] Regular Expressions: matching the dot . doesn't fully exclude all unicode line terminator characters
#5413 [BUG] Databricks 321 build fails - not found: type OrcShims320untilAllBase
#5286 [BUG] assert failed test_struct_self_join and test_computation_in_grpby_columns
#5351 [BUG] Build fails for Spark 3.3 due to extra arguments to mapKeyNotExistError
#5260 [BUG] map_test failures on Spark 3.3.0
#5189 [BUG] Reading from iceberg table will fail.
#5130 [BUG] string_split does not respect spark.rapids.sql.regexp.enabled config
#5267 [BUG] markdown link check failed issue
#5295 [BUG] Build fails for Spark 3.3 due to extra arguments to mapKeyNotExistError
#5264 [BUG] Delete unused generic type.
#5275 [BUG] rlike cannot run on GPU because invalid or unsupported escape character ']' near index 14
#5278 [BUG] build 311cdh failed: unable to find valid certification path to requested target
#5211 [BUG] csv_test:test_basic_csv_read FAILED
#5244 [BUG] Spark 3.3 integration test failures logic_test.py::test_logical_with_side_effect
#5041 [BUG] Implement hasSideEffects for all expressions that have side-effects
#4980 [BUG] window_function_test FAILED on PASCAL GPU
#5240 [BUG] EGX integration test_collect_list_reductions failures
#5242 [BUG] Executor falls back to cudaMalloc if the pool can't be initialized
#5215 [BUG] Coalescing reading is not working for v2 parquet/orc datasource
#5104 [BUG] Unconditional warning in UDF Plugin "The compiler is disabled by default"
#5099 [BUG] Profiling tool should not sum gettingResultTime
#5182 [BUG] Spark 3.3 integration tests arithmetic_ops_test.py::test_div_overflow_exception_when_ansi failures
#5147 [BUG] object LZ4Compressor is not a member of package ai.rapids.cudf.nvcomp
#4695 [BUG] Segfault with UCX and ASYNC allocator
#5138 [BUG] xgboost job failed if we enable PCBS
#5135 [BUG] GpuRegExExtract is not align with RegExExtract
#5084 [BUG] GpuWriteTaskStatsTracker complains for all writes in local mode
#5123 [BUG] Compile error for Spark330 because of VectorizedColumnReader constructor added a new parameter.
#5133 [BUG] Compile error for Spark330 because of Spark changed the method signature: QueryExecutionErrors.mapKeyNotExistError
#4959 [BUG] Test case in OpcodeSuite failed on Spark 3.3.0

PRs

#5863 Update 22.06 changelog to include new commits [skip ci]
#5861 [Doc]Add Spark3.3 support in doc for 22.06 branch[skip ci]
#5851 Update 22.06 changelog to include new commits [skip ci]
#5848 Update spark330shim to use released lib
#5840 [DOC] Updated RapidsConf to reflect the default value of spark.rapids.sql.improvedFloatOps.enabled [skip ci]
#5816 Update 22.06.0 changelog to latest [skip ci]
#5795 Update FAQ to include local jar deployment via extraClassPath [skip ci]
#5802 Update spark-rapids-jni.version to release 22.06.0
#5798 Fall back to CPU for RoundCeil and RoundFloor expressions
#5791 Remove ORC encryption test from 321cdh
#5766 Fix the overflow of container type when casting floats to decimal
#5786 Fix rounds over decimal in Spark 330+
#5761 Throw an exception when attempting to read columnar encrypted Parquet files on the GPU
#5784 Update the error string for test_cast_neg_to_decimal_err on 330
#5781 Correct the exception string for test_mod_pmod_by_zero on Spark 3.3.0
#5764 Add test for encrypted ORC write
#5760 Enable avrotest in nightly tests [skip ci]
#5746 Init 22.06 changelog [skip ci]
#5716 Disable Avro support when spark-avro classes not loadable by Shim classloader
#5737 Remove the ORC encryption tests
#5753 [DOC] Update regexp compatibility for 22.06 [skip ci]
#5738 Update Spark2 explain code for 22.06
#5731 Throw SparkDateTimeException for InvalidInput while casting in ANSI mode
#5742 Spark-3.3 build fix - Move QueryExecutionErrors to sql package
#5641 [Doc]Update 22.06 documentation[skip ci]
#5701 Update docs for qualification tool to reflect recommendations and UI [skip ci]
#5283 Add documentation for MIG on Dataproc [skip ci]
#5728 Qualification tool: Add test for stage failures
#5681 Branch 22.06 nvcomp notice binary [skip ci]
#5713 Fix GpuCast losing the timezoneId during canonicalization
#5715 Update GPU ORC statistics write support
#5718 Update the error message for div_by_zero test
#5604 ORC encrypted write should fallback to CPU
#5674 Fix reading ORC/PARQUET over empty clipped schema
#5676 Fix ORC reading over different schemas
#5693 Temporarily allow 3.3.1 for 3.3.0 shims.
#5591 Enable regular expressions by default
#5664 Fix edge case where one side of regexp choice ends in duplicate string anchors
#5542 Support arrays of arrays and structs for concat on arrays
#5677 Qualification tool Enable UI by default
#5575 Regexp: Transpile \D, \W to Java's definitions
#5668 Add user as CI owner [skip ci]
#5627 Install locales and generate en_US.UTF-8
#5514 ANSI mode: allow casting between numeric type and timestamp type
#5600 Qualification tool UI cosmetics and CSV output changes
#5658 Fallback to CPU when && found in character class
#5644 Qualification tool: Enable UDF reporting in potential problems
#5645 Add support for octal digits in character classes
#5643 Fix missing GpuBatchScanExec metrics in SQL UI
#5441 Enable optional float confs and update docs mentioning them
#5532 Support hex digits in character classes and escaped characters in character class ranges
#5625 [DOC]update links for 2206 release[skip ci]
#5623 Handle duplicates in negated character classes
#5533 Support GpuMapConcat
#5614 Move HostConcatResultUtil out of unshimmed classes
#5612 Qualification tool: update SQL Df value used and look at jobs in SQL
#5526 Fix whitespace \s and \S tests
#5541 Regexp: Transpile \d, \w to Java's definitions
#5598 Qualification tool: Update RunningQualificationApp tests
#5601 Update test_div_overflow_exception_when_ansi test for Spark-3.3
#5588 Update Databricks build scripts
#5599 Move ShimServiceProvider file re-init/truncate
#5531 Filter rows with null keys when coalescing due to reaching cuDF row limits
#5550 Qualification tool hook up final output based on per exec analysis
#5540 Support RaiseError
#5505 Support spark.sql.mapKeyDedupPolicy=LAST_WIN for TransformKeys
#5583 Disable spark snapshot shims build for pre-merge
#5584 Enable automerge from branch-22.06 to 22.08 [skip ci]
#5581 nightly CI to install and deploy cuda11 classifier dist jar [skip ci]
#5579 Update test_cast_neg_to_decimal_err to work with Databricks 10.4 where exception is different
#5578 Fix unfiltered partitions being used to create GpuBatchScanExec RDD
#5560 Minor: Clean up the tests of concat_list
#5528 Enable build and test with JDK11
#5571 Update array_min and array_max to use new cudf operations
#5558 Fix target file for update from extra-resources in dist module
#5556 Move FsInput creation into AvroFileReader
#5483 Don't distinguish between types of ArithmeticException for Spark 3.2.x
#5539 Fix IntervalSuite cases failure
#5421 Support multi-threaded reading for avro
#5538 Add tests for string to timestamp functions in ANSI mode
#5546 Set hasSideEffects correctly for GpuCreateMap
#5529 Fix failing bool agg test in Spark 3.3
#5500 Fallback parquet reading with merged schema and native footer reader
#5534 MVN_OPT to last, as it is empty in most cases
#5523 Enable forcePositionEvolution for 321cdh
#5501 Build against specified spark-rapids-jni snapshot jar [skip ci]
#5489 Fallback to the CPU if Parquet encryption keys are set
#5527 Fix bug with character class immediately following a string anchor
#5506 Fix ClassCastException in regular expression transpiler
#5519 Address feedback in "string anchors regexp replace" PR
#5520 [DOC] Remove Spark from our naming of Tools [skip ci]
#5491 Enables $, \z, and \Z in REGEXP_REPLACE on the GPU
#5470 Qualification tool support UI code generation
#5353 Supports casting between ANSI interval types and integral types
#5487 Add limited support for captured vars and athrow
#5499 [DOC]update doc for emr6.6[skip ci]
#5485 Add cudaStreamSynchronize when a new device buffer is added to the spill framework
#5477 Add support for \h, \H, \v, \V, and \R character classes
#5490 Qualification tool: Update speedup factor for few operators
#5494 Fix databrick Shim to support Ansi mode when casting from string to date
#5498 Enable 330 unit tests for nightly
#5504 Fix printing of split information when dumping debug data
#5486 Fix regression in AnsiCastOpSuite with Spark 3.3.0
#5436 Support map_filter operator
#5471 Add implicit safeFree for RapidsBuffer
#5465 Fix query planning issue when Iceberg is used with DPP and AQE
#5459 Add test cases for casting string to date in ANSI mode
#5443 Add support for regular expressions containing octal digits greater than \200
#5468 Qualification tool: Add support for join, pandas, aggregate execs
#5473 Remove hasNan check over array_contains
#5434 Check schema compatibility when building parquet readers
#5442 Add support for regular expressions containing hexadecimal digits greater than 0x7f
#5466 [Doc] Change the picture of the query plan to text format. [skip ci]
#5310 Use C++ to parse and filter parquet footers.
#5454 QualificationTool. Add speedup information to AppSummaryInfo
#5455 Moved ShimCurrentBatchIterator so it's visible to db312 and db321
#5354 Plugin should throw same arithmetic exceptions as Spark part1
#5440 Qualification tool support for read and write execs and more, add mapping stage times to sql execs
#5431 [DOC] Update the ubuntu repo key [skip ci]
#5425 Handle readBatch changes for Spark 3.3.0
#5438 Add tests for all-null data for array_max
#5428 Make the sync marker uniform for the Avro coalescing reader
#5432 Test case insensitive reading for Parquet and CSV
#5433 [DOC] Removed mention of 30x from shims.md [skip ci]
#5424 Exclude all unicode line terminator characters from matching dot
#5426 Qualification tool: Parsing Execs to get the ExecInfo #2
#5427 Workaround to fix cuda repo key rotation in ubuntu images [skip ci]
#5419 Append my id to blossom-ci whitelist [skip ci]
#5422 xfail tests for spark 3.3.0 due to changes in readBatch
#5420 Qualification tool: Parsing Execs to get the ExecInfo #1
#5418 Add GpuEqualToNoNans and update GpuPivotFirst to use to handle PivotFirst with NaN support enabled on GPU
#5306 Support coalescing reading for avro
#5410 Update docs for removal of 311cdh
#5414 Add 320+-noncdh to Databricks to fix 321db build
#5349 Enable some repetitions for \A and \Z
#5346 ADD 321cdh shim to rapids and remove 311cdh shim
#5408 [DOC] Add rebase mode notes for databricks doc [skip ci]
#5348 Qualification tool: Skip GPU event logs
#5400 Restore test_computation_in_grpby_columns and test_struct_self_join
#5399 Update New Issue template to recommend a Discussion or Question [skip ci]
#5293 Support array_repeat
#5359 Qualification tool base plan parsing infrastructure
#5360 Revert "skip failing tests for Spark 3.3.0 (#5313)"
#5326 Update GCP doc and scripts [skip ci]
#5352 Fix spark330 build due to mapKeyNotExistError changed
#5317 Support arrays_zip
#5316 Support ANSI mode for ToUnixTimestamp, UnixTimestamp, GetTimestamp, DateAddInterval
#5319 Re-enable support for \Z in regular expressions on the GPU
#5315 Simplify conditional catalyst expressions generated by udf-compiler
#5301 Support existence join type for broadcast nested loop join
#5313 skip failing tests for Spark 3.3.0
#5311 Add information about the discussion board to the README and FAQ [skip ci]
#5308 Remove unused ColumnViewUtil
#5289 Re-enable dollar ($) line anchor in regular expressions in find mode
#5274 Perform explicit UnsafeRow projection in ColumnarToRow transition
#5297 GpuStringSplit now honors thespark.rapids.sql.regexp.enabled configuration option
#5307 Remove compatibility guide reference to issue #4060
#5298 Qualification tool: Operator mapping from plugin to CSV file
#5266 Update Outdated GCP getting started guide[skip ci]
#5300 Fix DIST_JAR PATH in coverage-report [skip ci]
#5290 Add documentation about reporting security issues [skip ci]
#5277 Support multiple datatypes in TypeSig.withPsNote()
#5296 Fix spark330 build due to removal of isElementAt parameter from mapKeyNotExistError
#5291 fix dead links in shims.md [skip ci]
#5276 fix markdown check issue[skip ci]
#5270 Include dependency of common jar in tools jar
#5265 Remove unused generic types
#5288 Temporarily xfail tests to restore premerge builds
#5287 Fix nightly scripts to deploy w/ classifier correctly [skip ci]
#5134 Support division on ANSI interval types
#5279 Add test case for ANSI pmod and ANSI Remainder
#5284 Enable support for escaping the right square bracket
#5280 [BUG] Fix incorrect plugin nightly deployment and release [skip ci]
#5249 Use a bundled spark-rapids-jni dependency instead of external cudf dependency
#5268 [BUG] When ASYNC is enabled GDS needs to handle cudaMalloced bounce buffers
#5230 Update csv float tests to reflect changes in precision in cuDF
#5001 Add fuzzing test for JSON reader
#5155 Support casting between day-time interval and string
#5247 Fix test failure caused by change in Spark 3.3 exception
#5254 Fix the integration test of collect_list_reduction
#5243 Throw again after logging that RMM could not intialize
#5105 Support multiplication on ANSI interval types
#5171 Fix the bug COALESCING reading does not work for v2 parquet/orc datasource
#5157 Update the log warning of UDF compiler
#5213 Support sample on ANSI interval types
#5218 XFAIL tests that are failing due to issue 5211
#5202 Profiling tool: Remove gettingResultTime from stages & jobs aggregation
#5201 Fix merge conflict from branch-22.04
#5195 Refactor Spark33XShims to avoid code duplication
#5185 Fix test failure with Spark 3.3 by looking for less specific error message
#4992 Support Collect-like Reduction Aggregations
#5193 Fix auto merge conflict 5192 [skip ci]
#5020 Support arithmetic operators on ANSI interval types
#5174 Fix auto merge conflict 5173 [skip ci]
#5168 Fix auto merge conflict 5166
#5151 Remove NvcompLZ4CompressionCodec single-buffer APIs
#5132 Add count support for all types
#5141 Upgrade to UCX 1.12.1 for 22.06
#5143 Fix merge conflict with branch-22.04
#5144 Adapt to storage-partitioned join additions in SPARK-37377
#5139 Make mvn-verify check name more descriptive [skip ci]
#5136 Fix GpuRegExExtract about inconsistent to Spark
#5107 Fix GpuFileFormatDataWriter failing to stat file after commit
#5124 Fix ShimVectorizedColumnReader construction for recent Spark 3.3.0 changes
#5047 Change Cast.toString as "cast" instead of "ansi_cast" under ANSI mode
#5089 Enable regular expressions containing \s and \S
#5087 Add support for regexp_replace with back-references
#5110 Appending my id (mattahrens) to the blossom-ci whitelist [skip ci]
#5090 Add nvtx ranges around pre, agg, and post steps in hash aggregate
#5092 Remove single-buffer compression codec APIs
#5093 Fix leak when GDS buffer store closes
#5067 Premerge databricks CI autotrigger [skip ci]
#5083 Remove EMRShimVersion
#5076 Unshim cache serializer and other 311+-all code
#5074 Make ASYNC the default allocator for 22.06
#5073 Add in nvtx ranges for parquet filterBlocks
#5077 Change Scala style continuation indentation to be 2 spaces to match guide [skip ci]
#5070 Fix merge from 22.04 to 22.06
#5046 Init 22.06.0-SNAPSHOT
#5059 Fix merge from 22.04 to 22.06
#5036 Unshim many expressions
#4993 PCBS and Parquet support ANSI year month interval type
#5031 Unshim many SparkShim interfaces
#5027 Fix merge of branch-22.04 to branch-22.06
#5022 Unshim many Pandas execs
#5013 Unshim GpuRowBasedScalaUDF
#5012 Unshim GpuOrcScan and GpuParquetScan
#5010 Unshim GpuSumDefaults
#5007 Remove schema utils, case class copying, file partition, and legacy statistical aggregate shims
#4999 Enable automerge from branch-22.04 to branch-22.06 [skip ci]

Release 22.04

Features

#4734 [FEA] Support approx_percentile in reduction context
#1922 [FEA] Support ORC forced positional evolution
#123 [FEA] add in support for dayfirst formats in the CSV parser
#4863 [FEA] Improve timestamp support in JSON and CSV readers
#4935 [FEA] Support reading Avro: primitive types
#4915 [FEA] Drop support for Spark 3.0.1, 3.0.2, 3.0.3, Databricks 7.3 ML LTS
#4815 [FEA] Support org.apache.spark.sql.catalyst.expressions.ArrayExists
#3245 [FEA] GpuGetMapValue should support all valid value data types and non-complex key types
#4914 [FEA] Support for Databricks 10.4 ML LTS
#4945 [FEA] Support filter and comparisons on ANSI day time interval type
#4004 [FEA] Add support for percent_rank
#1111 [FEA] support spark.sql.legacy.timeParserPolicy when parsing CSV files
#4849 [FEA] Support parsing dates in JSON reader
#4789 [FEA] Add Spark 3.1.4 shim
#4646 [FEA] Make JSON parsing of NaN and Infinity values fully compatible with Spark
#4824 [FEA] Support reading decimals from JSON and CSV
#4814 [FEA] Support element_at with non-literal index
#4816 [FEA] Support org.apache.spark.sql.catalyst.expressions.GetArrayStructFields
#3542 [FEA] Support str_to_map function
#4721 [FEA] Support regular expression delimiters for str_to_map
#4791 Update Spark 3.1.3 to be released
#4712 [FEA] Allow to partition on Decimal 128 when running on the GPU
#4762 [FEA] Improve support for reading JSON integer types
#4696 [FEA] Support casting map to string
#1572 [FEA] Add in decimal support for pmod, remainder and divide
#4763 [FEA] Improve support for reading JSON boolean types
#4003 [FEA] Add regular expression support to GPU implementation of StringSplit
#4626 [FEA] cannot run on GPU because unsupported data types in 'partitionSpec'
#33 [FEA] hypot SQL function
#4515 [FEA] Set RMM async allocator as default

Performance

#3026 [FEA] [Audit]: Set the list of read columns in the task configuration to reduce reading of ORC data
#4895 Add support for structs in GpuScalarSubquery
#4393 [BUG] Columnar to Columnar transfers are very slow
#589 [FEA] Support ExistenceJoin
#4784 [FEA] Improve copying decimal data from CPU columnar data
#4685 [FEA] Avoid regexp cost in string_split for escaped characters
#4777 Remove input upcast in GpuExtractChunk32
#4722 Optimize DECIMAL128 average aggregations
#4645 [FEA] Investigate ASYNC allocator performance with additional queries
#4539 [FEA] semaphore optimization in shuffled hash join
#2441 [FEA] Use AST for filter in join APIs

Bugs Fixed

#5233 [BUG] rapids-tools v22.04.0 release jar reports maven dependency issue : rapids-4-spark-common_2.12:jar:22.04.0 NOT FOUND
#5183 [BUG] UCX EGX integration test array_test.py::test_array_exists failures
#5180 [BUG] create_map failed with java.lang.IllegalStateException: This is not supported yet
#5181 [BUG] Dataproc tests failing when trying to detect for accelerated row conversions
#5154 [BUG] build failed in databricks 10.4 runtime (updated recently)
#5159 [BUG] Approx percentile query fails with UnsupportedOperationException
#5164 [BUG] Databricks 9.1ML failed with "java.lang.NoSuchMethodError: org.apache.spark.sql.execution.metric.SQLMetrics$.createSizeMetric"
#5125 [BUG] GpuCast.hasSideEffects does not check if child expression has side effects
#5091 [BUG] Profiling tool fails process custom task accumulators of type CollectionAccumulator
#5050 [BUG] Release build of v22.04.0 FAILED on "Execution attach-javadoc failed: NullPointerException" with maven option '-P source-javadoc'
#5035 [BUG] Different CSV parsing behavior between 22.04 and 22.02
#5065 [BUG] spark330+ build error due to SPARK-37463
#5019 [BUG] udf compiler failed to translate UDF in spark-shell
#5048 [BUG] OOM for q18 of TPC-DS benchmark testing on Spark2a
#5038 [BUG] When spark.rapids.sql.regexp.enabled is on in 22.04 snapshot jars, Reading a Delta table in Databricks may cause driver error
#5023 [BUG] When+sequence could trigger "Illegal sequence boundaries" error
#5021 [BUG] test_cache_reverse_order failed
#5003 [BUG] Cloudera 3.1.1 tests fail due to ClouderaShimVersion
#4960 [BUG] Spark 3.3 IT cache_test:test_passing_gpuExpr_as_Expr failure
#4913 [BUG] Fall back to the CPU if we see a scale on Ceil or Floor
#4806 [BUG] When running xgboost training, if PCBS is enabled, it fails with java.lang.AssertionError
#4542 [BUG] test_write_round_trip failed Maximum pool size exceeded
#4911 [BUG][Audit] [SPARK-38314] - Fail to read parquet files after writing the hidden file metadata
#4936 [BUG] databricks nightly window_function_test failures
#4931 [BUG] Spark 3.3 IT test cache_test.py::test_passing_gpuExpr_as_Expr fails with IllegalArgumentException
#4710 [BUG] cudaErrorIllegalAddress for q95 (3TB) on GCP with ASYNC allocator
#4918 [BUG] databricks nightly build failed
#4826 [BUG] cache_test failures when testing with 128-bit decimal
#4855 [BUG] Shim tests in sql-plugin module are not running
#4487 [BUG] regexp_find hangs with some patterns
#4486 [BUG] Regular expressions with hex digits not working as expected
#4879 [BUG] [SPARK-38237][SQL] ClusteredDistribution clustering keys break build with wrong arguments
#4883 [BUG] row-based_udf_test.py::test_hive_empty_* fail nightly tests
#4876 [BUG] Nightly build failed on Databricks with "pip: No such file or directory"
#4739 [BUG] Plugin will crash with query > 100 columns on pascal GPU
#4840 [BUG] test_dpp_via_aggregate_subquery_aqe_off failed with table already exists
#4841 [BUG] test_compress_write_round_trip failed on Spark 3.3
#4668 [FEA][Audit] - [SPARK-37750][SQL] ANSI mode: optionally return null result if element not exists in array/map
#3971 [BUG] udf-examples dependencies are incorrect
#4022 [BUG] Ensure shims.v2.ParquetCachedBatchSerializer and similar classes are at most package-private
#4526 [BUG] Short circuit AND/OR in ANSI mode
#4787 [BUG] Dataproc notebook IT test failure - NoSuchMethodError: org.apache.spark.network.util.ByteUnit.toBytes
#4704 [BUG] Update the premerge and nightly tests after moving the UDF example to external repository
#4795 [BUG] Read ORC does not ignoreCorruptFiles
#4802 [BUG] GPU CSV read does not honor ignoreCorruptFiles or ignoreMissingFiles
#4803 [BUG] GPU JSON read does not honor ignoreCorruptFiles or ignoreMissingFiles
#1986 [BUG] CSV reading null inconsistent between spark.rapids.sql.format.csv.enabled=true&false
#126 [BUG] CSV parsing large number values overflow
#4759 [BUG] Profiling tool can miss datasources when they are GPU reads
#4798 [BUG] Integration test builds failing with worker_id not found
#4727 [BUG] Read Parquet does not ignoreCorruptFiles
#4744 [BUG] test_groupby_std_variance_partial_replace_fallback failed
#4761 [BUG] test_simple_partitioned_read failed on Spark 3.3
#2071 [BUG] parsing invalid boolean CSV values return true instead of null
#4749 [BUG] test_write_empty_parquet_round_trip failed
#4730 [BUG] python UDF tests are leaking
#4290 [BUG] Investigate q32 and q67 for decimals potential regression
#4409 [BUG] Possible race condition in regular expression support for octal digits
#4728 [BUG] test_mixed_compress_read orc_test.py failures
#4736 [BUG] buildall --profile=321 fails on missing spark301 rapids-4-spark-sql dependency
#4702 [BUG] cache_test.py failed w/ cache.serializer in spark 3.3.0
#4031 [BUG] Spark 3.3.0 test failure: NoSuchMethodError org.apache.orc.TypeDescription.getAttributeValue
#4664 [BUG] MortgageAdaptiveSparkSuite failed with duplicate buffer exception
#4564 [BUG] map_test ansi failed in spark330
#119 [BUG] LIKE does not work if null chars are in the string
#124 [BUG] CSV/JSON Parsing some float values results in overflow
#4045 [BUG] q93 failed in this week's NDS runs
#4488 [BUG] isCastingStringToNegDecimalScaleSupported seems set wrong for some Spark versions

PRs

#5251 Update 22.04 changelog to latest [skip ci]
#5232 Fix issue in GpuArrayExists where a parent view outlived the child
#5239 Fix tools depending on the common jar
#5205 Update 22.04 changelog to latest [skip ci]
#5190 Fix column->row conversion GPU check:
#5184 Fix CPU fallback for Map lookup
#5191 Update version-def to use released cudfjni 22.04.0 [skip ci]
#5167 Update cudfjni version to released 22.04.0
#5169 Terminate test earlier if pytest ENV issue [skip ci]
#5160 Fix approximate percentile reduction UnsupportedOperationException
#5165 Update Databricks 10.4 for changes to the QueryStageExec and ClusteredDistribution
#4997 Update docs for the 22.04 release[skip ci]
#5146 Support env var INTEGRATION_TEST_VERSION to override shim version
#5103 Init 22.04 changelog [skip ci]
#5122 Disable GPU accelerated row-column transpose for Pascal GPUs:
#5127 GpuCast.hasSideEffects now checks to see if the child expression has side-effects
#5118 On task failure catch some CUDA exceptions and kill executor
#5069 Update for the public release [skip ci]
#5097 Implement hasSideEffects for GpuGetArrayItem, GpuElementAt, GpuGetMapValue, GpuUnaryMinus, and GpuAbs
#5079 Disable spark snapshot shims pre-merge build in 22.04
#5094 Fix profiling tool reading collectionAccumulator
#5078 Disable JSON and CSV floating-point reads by default
#4961 Support approx_percentile in reduction context
#5062 Update Spark 2.x explain API with changes in 22.04
#5066 Add getOrcSchemaString for OrcShims
#5030 Fix regression from 21.12 where udfs defined in repl no longer worked
#5051 Revert "Replace ParquetFileReader.readFooter with open() and getFooter "
#5052 Work around incompatibility between Databricks Delta loads and GpuRegExpExtract
#4972 Add support for ORC forced positional evolution
#5042 Implement hasSideEffects for GpuSequence
#5040 Fix missing imports for 321db shim
#5033 Removed limit from the test
#4938 Improve compatibility when reading timestamps from JSON and CSV sources
#5026 Update RoCE doc URL [skip ci]
#4976 Replace ParquetFileReader.readFooter with open() and getFooter
#4989 Use conf.useCompression config to decide if we should be compressing the cache
#4956 Add avro reader support
#5009 Remove references of shims folder in docs [skip ci]
#5004 Add ClouderaShimVersion to unshimmed files
#4971 Fall back to the CPU for non-zero scale on Ceil or Floor functions
#4996 Fix collect_set on struct type
#4998 Added the id back for struct children to make them unique
#4995 Include 321db shim in distribution build [skip ci]
#4981 Update doc for CSV reading interval
#4973 Implement support for ArrayExists expression
#4988 Remove support for Spark 3.0.x
#4955 Add UDT support to ParquetCachedBatchSerializer (CPU)
#4994 Add databricks 10.4 build in pre-merge
#4990 Remove 30X permerge support for version 22.04 and above [skip ci]
#4958 Add independent mvn verify check [skip ci]
#4933 Set OrcConf.INCLUDE_COLUMNS for ORC reading
#4944 Support for non-string key-types for GetMapValue and element_at()
#4974 Add shim for Databricks 10.4
#4907 Add markdown check action
#4977 Add missing 314 to buildall script
#4927 Support reading ANSI day time interval type from CSV source
#4965 Documentation: add example python api call for ExplainPlan.explainPotentialGpuPlan [skip ci]
#4957 Document agg pushdown on ORC file limitation [skip ci]
#4946 Support predictors on ANSI day time interval type
#4952 Have a fixed GPU memory size for integration tests
#4954 Fix of failing to read parquet files after writing the hidden file metadata in
#4953 Add Decimal 128 as a supported type in partition by for databricks running window
#4941 Use new list reduction API to improve performance
#4926 Support DayTimeIntervalType in ParquetCachedBatchSerializer
#4947 Fallback to ARENA if ASYNC configured and driver < 11.5.0
#4934 Replace MetadataAttribute with FileSourceMetadataAttribute to follow the update in Spark for 3.3.0+
#4942 Fix window rank integration tests on
#4928 Disable regular expressions on GPU by default
#4923 Support GpuScalarSubquery on nested types
#4924 Implement percent_rank() on GPU
#4853 Improve date support in JSON and CSV readers
#4930 Add in support for sorting arrays with structs in sort_array
#4861 Add Apache Spark 3.1.4-SNAPSHOT Shims
#4925 Remove unused Spark322PlusShims
#4921 Add DatabricksShimVersion to unshimmed class list
#4917 Default some configs to protect against cluster settings in integration tests
#4922 Add support for decimal 128 for db and spark 320+
#4919 Case-insensitive PR title check [skip ci]
#4796 Implement ExistenceJoin Iterator using an auxiliary left semijoin
#4857 Transition to v2 shims [Databricks]
#4899 Fixed Decimal 128 bug in ParquetCachedBatchSerializer
#4810 Support ANSI intervals to/from Parquet
#4909 Make ARENA the default allocator for 22.04
#4856 Enable shim tests in sql-plugin module
#4880 Bump hadoop-client dependency to 3.1.4
#4825 Initial support for reading decimal types from JSON and CSV
#4859 Fallback to CPU when Spark pushes down Aggregates (Min/Max/Count) for ORC
#4872 Speed up copying decimal column from parquet buffer to GPU buffer
#4904 Relocate Hive UDF Classes
#4871 Minor changes to print revision differences when building shims
#4882 Disable write/read Parquet when Parquet field IDs are used
#4858 Support non-literal index for GpuElementAt and GpuGetArrayItem
#4875 Support running GetArrayStructFields on GPU
#4885 Enable fuzz testing for Regular Expression repetitions and move remaining edge cases to CPU
#4869 Support for hexadecimal digits in regular expressions on the GPU
#4854 Avoid regexp_cost with stringSplit on the GPU using transpilation
#4888 Clean up leak detection code
#4901 fix a broken link in CONTRIBUTING.md[skip ci]
#4891 update getting started doc because aws-emr 6.5.0 released[skip ci]
#4881 Fix compilation error caused by ClusteredDistribution parameters
#4890 Integration-test tests jar for hive UDF tests
#4878 Set conda/mamba default to Python version to 3.8 [skip ci]
#4874 Fix spark-tests syntax issue [skip ci]
#4850 Also check cuda runtime version when using the ASYNC allocator
#4851 Add worker ID to temporary table names in tests
#4847 Fix test_compress_write_round_trip failure on Spark 3.3
#4848 Profile tool: fix printing of task failed reason
#4636 Support str_to_map
#4835 Trim parquet_write_test to reduce integration test runtime
#4819 Throw exception if casting from double to datetime
#4838 Trim cache tests to improve integration test time
#4839 Optionally return null if element not exists map/array
#4822 Push decimal workarounds to cuDF
#4619 Move the udf-examples module to the external repository spark-rapids-examples
#4844 Update spark313 dep to released one
#4827 Make InternalExclusiveModeGpuDiscoveryPlugin and ExplainPlanImpl as protected class.
#4836 Support WindowExec partitioning by Decimal 128 on the GPU
#4760 Short circuit AND/OR in ANSI mode
#4829 Make bloopInstall version configurable in buildall
#4823 Reduce redundancy of decimal testing
#4715 Patterns such (3?)+ should now fall back to CPU
#4809 Add ignoreCorruptFiles for ORC readers
#4790 Improve JSON and CSV parsing of integer values
#4812 Default integration test configs to allow negative decimal scale
#4805 Avoid output cast by using unsigned type output for GpuExtractChunk32
#4804 Profiling tool can miss datasources when they are GPU reads
#4797 Do not check for metadata during schema comparison
#4785 Support casting Map to String
#4794 Decimal-128 support for mod and pmod
#4799 Fix failure to generate worker_id when xdist is not present
#4742 Add ignoreCorruptFiles feature for Parquet reader
#4792 Ensure GpuM2 merge aggregation does not produce a null mean or m2
#4770 Improve columnarCopy for HostColumnarToGpu
#4776 Improve aggregation performance of average on DECIMAL128 columns
#4786 Add shims to compare ORC TypeDescription
#4780 Improve JSON and CSV support for boolean values
#4778 Decrease chance of random collisions in test temporary paths
#4782 Check in host leak detection code
#4781 Add Spark properties table to profiling tool output
#4714 Add regular expression support to string_split
#4754 Close SpillableBatch to avoid leaks
#4758 Fix merge conflict with branch-22.02 [skip ci]
#4694 Add clarifications and details to integration-tests README [skip ci]
#4740 Enable regular expressions on GPU by default
#4735 Re-enables partial regex support for octal digits on the GPU
#4737 Check for a null compression codec when creating ORC OutStream
#4738 Change resume-from to aggregator in buildall [skip ci]
#4698 Add tests for few json options
#4731 Trim join tests to improve runtime of tests
#4732 Fix failing serializer tests on Spark 3.3.0
#4709 Update centos 8 dockerfile to handle EOL issue [skip ci]
#4724 Debug dump to Parquet support for DECIMAL128 columns
#4688 Optimize DECIMAL128 sum aggregations
#4692 Add FAQ entry to discuss executor task concurrency configuration [skip ci]
#4588 Optimize semaphore acquisition in GpuShuffledHashJoinExec
#4697 Add preliminary test and test framework changes for ExistanceJoin
#4716 GpuStringSplit should return an array on not-null elements
#4611 Support BitLength and OctetLength
#4408 Use the ORC version that corresponds to the Spark version
#4686 Fall back to CPU for queries referencing hidden metadata columns
#4669 Prevent deadlock between RapidsBufferStore and RapidsBufferBase on close
#4707 Fix auto merge conflict 4705 [skip ci]
#4690 Fix map_test ANSI failure in Spark 3.3.0
#4681 Reimplement check for non-regexp strings using RegexParser
#4683 Fix documentation link, clarify documentation [skip ci]
#4677 Make Collect, first and last as deterministic aggregate functions for Spark-3.3
#4682 Enable test for LIKE with embedded null character
#4673 Allow GpuWindowExec to partition on structs
#4637 Improve support for reading CSV and JSON floating-point values
#4629 Remove shims module
#4648 Append new authorized user to blossom-ci safelist
#4623 Fallback to CPU when aggregate push down used for parquet
#4606 Set default RMM pool to ASYNC for cuda 11.2+
#4531 Use libcudf mixed joins for conditional hash semi and anti joins
#4624 Enable integration test results report on Jenkins [skip ci]
#4597 Update plugin version to 22.04.0-SNAPSHOT
#4592 Adds SQL function HYPOT using the GPU
#4504 Implement AST-based regular expression fuzz tests
#4560 Make shims.v2.ParquetCachedBatchSerializer as protected

Release 22.02

Features

#4305 [FEA] write nvidia tool wrappers to allow old YARN versions to work with MIG
#4410 [FEA] ReplicateRows - Support ReplicateRows for decimal 128 type
#4360 [FEA] Add explain api for Spark 2.X
#3541 [FEA] Support max on single-level struct in aggregation context
#4238 [FEA] Add a Spark 3.X Explain only mode to the plugin
#3952 [Audit] [FEA][SPARK-32986][SQL] Add bucketed scan info in query plan of data source v1
#4412 [FEA] Improve support for \A, \Z, and \z in regular expressions
#3979 [FEA] Improvements for CPU(Row) based UDF
#4467 [FEA] Add support for regular expression with repeated digits (\d+, \d*, \d?)
#4439 [FEA] Enable GPU broadcast exchange reuse for DPP when AQE enabled
#3512 [FEA] Support org.apache.spark.sql.catalyst.expressions.Sequence
#3475 [FEA] Spark 3.2.0 reads Parquet unsigned int64(UINT64) as Decimal(20,0) but CUDF does not support it
#4091 [FEA] regexp_replace: Improve support for ^ and $
#4104 [FEA] Support org.apache.spark.sql.catalyst.expressions.ReplicateRows
#4027 [FEA] Support SubqueryBroadcast on GPU to enable exchange reuse during DPP
#4284 [FEA] Support idx = 0 in GpuRegExpExtract
#4002 [FEA] Implement regexp_extract on GPU
#3221 [FEA] Support GpuFirst and GpuLast on nested types under reduction aggregations
#3944 [FEA] Full support for sum with overflow on Decimal 128
#4028 [FEA] support GpuCast from non-nested ArrayType to StringType
#3250 [FEA] Make CreateMap duplicate key handling compatible with Spark and enable CreateMap by default
#4170 [FEA] Make regular expression behavior with $ and \r consistent with CPU
#4001 [FEA] Add regexp support to regexp_replace
#3962 [FEA] Support null characters in regular expressions in RLIKE
#3797 [FEA] Make RLike support consistent with Apache Spark

Performance

#4392 [FEA] could the parquet scan code avoid acquiring the semaphore for an empty batch?
#679 [FEA] move some deserialization code out of the scope of the gpu-semaphore to increase cpu concurrent
#4350 [FEA] Optimize the all-true and all-false cases in GPU If and CaseWhen
#4309 [FEA] Leverage cudf conditional nested loop join to implement semi/anti hash join with condition
#4395 [FEA] acquire the semaphore after concatToHost in GpuShuffleCoalesceIterator
#4134 [FEA] Allow EliminateJoinToEmptyRelation in GpuBroadcastExchangeExec
#4189 [FEA] understand why between is so expensive

Bugs Fixed

#4316 [BUG] Exception: Unable to find py4j, your SPARK_HOME may not be configured correctly intermittently
#4725 [DOC] Broken links in guide doc
#4675 [BUG] Jenkins integration build timed out at 10 hours
#4665 [BUG] Spark321Shims.getParquetFilters failed with NoSuchMethodError
#4635 [BUG] nvidia-smi wrapper script ignores ENABLE_NON_MIG_GPUS=1 on a heterogeneous multi-GPU machine
#4500 [BUG] Build failures against Spark 3.2.1 rc1 and make 3.2.1 non snapshot
#4631 [BUG] Release build with mvn option -P source-javadoc FAILED
#4625 [BUG] NDS query 5 fails with AdaptiveSparkPlanExec assertion
#4632 [BUG] Build failing for Spark 3.3.0 due to deprecated method warnings
#4599 [BUG] test_group_apply_udf and test_group_apply_udf_more_types hangs on Databricks 9.1
#4600 [BUG] crash if we have a decimal128 in a struct in an array
#4581 [BUG] Build error "GpuOverrides.scala:924: wrong number of arguments" on DB9.1.x spark-3.1.2
#4593 [BUG] dup GpuHashJoin.diff case-folding issue
#4559 [BUG] regexp_replace with replacement string containing \ can produce incorrect results
#4503 [BUG] regexp_replace with back references produces incorrect results on GPU
#4567 [BUG] Profile tool hangs in compare mode
#4315 [BUG] test_hash_reduction_decimal_overflow_sum[30] failed OOM in integration tests
#4551 [BUG] protobuf-java version changed to 3.x
#4499 [BUG]GpuSequence blows up when nulls exist in any of the inputs (start, stop, step)
#4454 [BUG] Shade warnings when building the tools artifact
#4541 [BUG] Column vector leak in conditionals_test.py
#4514 [BUG] test_hash_reduction_pivot_without_nans failed
#4521 [BUG] Inconsistencies in handling of newline characters and string and line anchors
#4548 [BUG] ai.rapids.cudf.CudaException: an illegal instruction was encountered in databricks 9.1
#4475 [BUG] \D and \W match newline in Spark but not in cuDF
#1866 [BUG] GpuFileFormatWriter does not close the data writer
#4524 [BUG] RegExp transpiler fails to detect some choice expressions that cuDF cannot compile
#3226 [BUG]OOM happened when do cube operations
#2504 [BUG] OOM when running NDS queries with UCX and GDS
#4273 [BUG] Rounding past the size that can be stored in a type produces incorrect results
#4060 [BUG] test_hash_groupby_approx_percentile_long_repeated_keys failed intermittently
#4039 [BUG] Spark 3.3.0 IT Array test failures
#3849 [BUG] In ANSI mode we can fail in cases Spark would not due to conditionals
#4445 [BUG] mvn clean prints an error message on a clean dir
#4421 [BUG] the driver is trying to load CUDA with latest 22.02
#4455 [BUG] join_test.py::test_struct_self_join[IGNORE_ORDER({'local': True})] failed in spark330
#4442 [BUG] mvn build FAILED with option -P noSnapshotsWithDatabricks
#4281 [BUG] q9 regression between 21.10 and 21.12
#4280 [BUG] q88 regression between 21.10 and 21.12
#4422 [BUG] Host column vectors are being leaked during tests
#4446 [BUG] GpuCast crashes when casting from Array with unsupportable child type
#4432 [BUG] nightly build 3.3.0 failed: HashClusteredDistribution is not a member of org.apache.spark.sql.catalyst.plans.physical
#4443 [BUG] SPARK-37705 breaks parquet filters from Spark 3.3.0 and Spark 3.2.2 onwards
#4378 [BUG] udf_test udf_cudf_test failed require_minimum_pandas_version check in spark 320+
#4423 [BUG] Build is failing due to FileScanRDD changes in Spark 3.3.0-SNAPSHOT
#4401 [BUG]array_test.py::test_array_contains failures
#4403 [BUG] NDS query 72 logs codegen fallback exception and produces incorrect results
#4386 [BUG] conditionals_test.py FAILED with side_effects_cast[Integer/Long] on Databricks 9.1 Runtime
#3934 [BUG] Dependencies of published integration tests jar are missing
#4341 [BUG] GpuCast.scala:nnn warning: discarding unmoored doc comment
#4356 [BUG] nightly spark303 deploy pulling spark301 aggregator
#4347 [BUG] Dist jar pom lists aggregator jar as dependency
#4176 [BUG] ParseDateTimeSuite UT failed
#4292 [BUG] no meaningful message is surfaced to maven when binary-dedupe fails
#4351 [BUG] Tests FAILED On SPARK-3.2.0, com.nvidia.spark.rapids.SerializedTableColumn cannot be cast to com.nvidia.spark.rapids.GpuColumnVector
#4346 [BUG] q73 decimal was twice as slow in weekly results
#4334 [BUG] GpuColumnarToRowExec will always be tagged False for exportColumnarRdd after Spark311
#4339 The parameter dataType is not necessary in resolveColumnVector method.
#4275 [BUG] Row-based Hive UDF will fail if arguments contain a foldable expression.
#4229 [BUG] regexp_replace [^a] has different behavior between CPU and GPU for multiline strings
#4294 [BUG] parquet_write_test.py::test_ts_write_fails_datetime_exception failed in spark 3.1.1 and 3.1.2
#4205 [BUG] Get different results when casting from timestamp to string
#4277 [BUG] cudf_udf nightly cudf import rmm failed
#4246 [BUG] Regression in CastOpSuite due to cuDF change in parsing NaN
#4243 [BUG] test_regexp_replace_null_pattern_fallback[ALLOW_NON_GPU(ProjectExec,RegExpReplace)] failed in databricks
#4244 [BUG] Cast from string to float using hand-picked values failed
#4227 [BUG] RAPIDS Shuffle Manager doesn't fallback given encryption settings
#3374 [BUG] minor deprecation warnings in a 3.2 shim build
#3613 [BUG] release312db profile pulls in 311until320-apache
#4213 [BUG] unused method with a misleading outdated comment in ShimLoader
#3609 [BUG] GpuShuffleExchangeExec in v2 shims has inconsistent packaging
#4127 [BUG] CUDF 22.02 nightly test failure

PRs

#4773 Update 22.02 changelog to latest [skip ci]
#4771 revert cudf api links from legacy to stable[skip ci]
#4767 Update 22.02 changelog to latest [skip ci]
#4750 Updated doc for decimal support
#4757 Update qualification tool to remove DECIMAL 128 as potential problem
#4755 Fix databricks doc for limitations.[skip ci]
#4751 Fix broken hyperlinks in documentation [skip ci]
#4706 Update 22.02 changelog to latest [skip ci]
#4700 Update cudfjni version to released 22.02.0
#4701 Decrease nighlty tests upper limitation to 7 [skip ci]
#4639 Update changelog for 22.02 and archive info of some older releases [skip ci]
#4572 Add download page for 22.02 [skip ci]
#4672 Revert "Disable 311cdh build due to missing dependency (#4659)"
#4662 Update the deploy script [skip ci]
#4657 Upmerge spark2 directory to the latest 22.02 changes
#4659 Disable 311cdh build by default because of a missing dependency
#4508 Fix Spark 3.2.1 build failures and make it non-snapshot
#4652 Remove non-deterministic test order in nightly [skip ci]
#4643 Add profile release301 when mvn help:evaluate
#4630 Fix the incomplete capture of SubqueryBroadcast
#4633 Suppress newTaskTempFile method warnings for Spark 3.3.0 build
#4618 [DB31x] Pick the correct Python runner for flatmap-group Pandas UDF
#4622 Fallback to CPU when encoding is not supported for JSON reader
#4470 Add in HashPartitioning support for decimal 128
#4535 Revert "Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075 (#4471)"
#4583 Avoid unapply on PromotePrecision
#4573 Correct version from 21.12 to 22.02[skip ci]
#4575 Correct and update links in UDF doc[skip ci]
#4501 Switch and/or to use new cudf binops to improve performance
#4594 Resolve case-folding issue [skip ci]
#4585 Spark2 module upmerge, deploy script, and updates for Jenkins
#4589 Increase premerge databricks IDLE_TIMEOUT to 4 hours [skip ci]
#4485 Add json reader support
#4556 regexp_replace with back-references should fall back to CPU
#4569 Fix infinite loop with Profiling tool compare mode and app with no sql ids
#4529 Add support for Spark 2.x Explain Api
#4577 Revert "Fix CVE-2021-22569 (#4545)"
#4520 GpuSequence refactor
#4570 A few quick fixes to try to reduce max memory usage in the tests
#4477 Use libcudf mixed joins for conditional hash joins
#4566 remove scala-library from combined tools jar
#4552 Fix resource leak in GpuCaseWhen
#4553 Reenable test_hash_reduction_pivot_without_nans
#4530 Fix correctness issues in regexp and add \r and \n to fuzz tests
#4549 Fix typos in integration tests README [skip ci]
#4545 Fix CVE-2021-22569
#4543 Enable auto-merge from branch-22.02 to branch-22.04 [skip ci]
#4540 Remove user kuhushukla
#4434 Support max on single-level struct in aggregation context
#4534 Temporarily disable integration test - test_hash_reduction_pivot_without_nans
#4322 Add an explain only mode to the plugin
#4497 Make better use of pinned memory pool
#4512 remove hadoop version requirement[skip ci]
#4527 Fall back to CPU for regular expressions containing \D or \W
#4525 Properly close data writer in GpuFileFormatWriter
#4502 Removed the redundant test for element_at and fixed the failing one
#4523 Add more integration tests for decimal 128
#3762 Call the right method to convert table from row major <=> col major
#4482 Simplified the construction of zero scalar in GpuUnaryMinus
#4510 Update copyright in NOTICE [skip ci]
#4484 Update GpuFileFormatWriter to stay in sync with recent Spark changes, but still not support writing Hive bucketed table on GPU.
#4492 Fall back to CPU for regular expressions containing hex digits
#4495 Enable approx_percentile by default
#4420 Fix up incorrect results of rounding past the max digits of data type
#4483 Update test case of reading nested unsigned parquet file
#4490 Remove warning about RMM default allocator
#4461 [Audit] Add bucketed scan info in query plan of data source v1
#4489 Add arrays of decimal128 to join tests
#4476 Don't acquire the semaphore for empty input while scanning
#4424 Improve support for regular expression string anchors \A, \Z, and \z
#4491 Skip the test for spark versions 3.1.1, 3.1.2 and 3.2.0 only
#4459 Use merge sort for struct types in non-key columns
#4494 Append new authorized user to blossom-ci whitelist [skip ci]
#4400 Enable approx percentile tests
#4471 Disable orc write by default because of https://issues.apache.org/jira/browse/ORC-1075
#4462 Rename DECIMAL_128_FULL and rework usage of TypeSig.gpuNumeric
#4479 Change signoff check image to slim-buster [skip ci]
#4464 Throw SparkArrayIndexOutOfBoundsException for Spark 3.3.0+
#4469 Support repetition of \d and \D in regexp functions
#4472 Modify docs for 22.02 to address issue-4319[skip ci]
#4440 Enable GPU broadcast exchange reuse for DPP when AQE enabled
#4376 Add sequence support
#4460 Abstract the text based PartitionReader
#4383 Fix correctness issue with CASE WHEN with expressions that have side-effects
#4465 Refactor for shims 320+
#4463 Avoid replacing a hash join if build side is unsupported by the join type
#4456 Fix build issues: 1 clean non-exists target dirs; 2 remove duplicated plugin
#4416 Unshim join execs
#4172 Support String to Decimal 128
#4458 Exclude some metadata operators when checking GPU replacement
#4451 Some metrics improvements and timeline reporting
#4435 Disable add profile src execution by default to make the build log clean
#4436 Print error log to stderr output
#4155 Add partial support for line begin and end anchors in regexp_replace
#4428 Exhaustively iterate ColumnarToRow iterator to avoid leaks
#4430 update pca example link in ml-integration.md[skip ci]
#4452 Limit parallelism of nightly tests [skip ci]
#4449 Add recursive type checking and fallback tests for casting array with unsupported element types to string
#4437 Change logInfo to logWarning
#4447 Fix 330 build error and add 322 shims layer
#4417 Fix an Intellij debug issue
#4431 Add DateType support for AST expressions
#4433 Import the right pandas from conda [skip ci]
#4419 Import the right pandas from conda
#4427 Update getFileScanRDD shim for recent changes in Spark 3.3.0
#4397 Ignore cufile.log
#4388 Add support for ReplicateRows
#4399 Update docs for Profiling and Qualification tool to change wording
#4407 Fix GpuSubqueryBroadcast on multi-fields relation
#4396 GpuShuffleCoalesceIterator acquire semaphore after host concat
#4361 Accommodate altered semantics of cudf::lists::contains()
#4394 Use correct column name in GpuIf test
#4385 Add missing GpuSubqueryBroadcast replacement rule for spark31x
#4387 Fix auto merge conflict 4384[skip ci]
#4374 Fix the IT module depends on the tests module
#4365 Not publishing integration_tests jar to Maven Central [skip ci]
#4358 Update GpuIf to support expressions with side effects
#4382 Remove unused scallop dependency from integration_tests
#4364 Replace Scala document with Scala comment for inner functions
#4373 Add pytest tags for nightly test parallel run [skip ci]
#4150 Support GpuSubqueryBroadcast for DPP
#4372 Move casting to string tests from array_test.py and struct_test.py to cast_test.py
#4371 Fix typo in skipTestsFor330 calculation [skip ci]
#4355 Dedicated deploy-file with reduced pom in nightly build [skip ci]
#4352 Revert "Ignore failing string to timestamp tests temporarily (#4197)"
#4359 Audit - SPARK-37268 - Remove unused variable in GpuFileScanRDD [Databricks]
#4327 Print meaningful message when calling scripts in maven
#4354 Fix regression in AQE optimizations
#4343 Fix issue with binding to hash agg columns with computation
#4285 Add support for regexp_extract on the GPU
#4349 Fix PYTHONPATH in pre-merge
#4269 The option for the nightly script not deploying jars [skip ci]
#4335 Fix the issue of exporting Column RDD
#4336 Split expensive pytest files in cases level [skip ci]
#4328 Change the explanation of why the operator will not work on GPU
#4338 Use scala Int.box instead of Integer constructors
#4340 Remove the unnecessary parameter dataType in resolveColumnVector method
#4256 Allow returning an EmptyHashedRelation when a broadcast result is empty
#4333 Add tests about writing empty table to ORC/PAQUET
#4337 Support GpuFirst and GpuLast on nested types under reduction aggregations
#4331 Fix parquet options builder calls
#4310 Fix typo in shim class name
#4326 Fix 4315 decrease concurrentGpuTasks to avoid sum test OOM
#4266 Check revisions for all shim jars while build all
#4282 Use data type to create an inspector for a foldable GPU expression.
#3144 Optimize AQE with Spark 3.2+ to avoid redundant transitions
#4317 [BUG] Update nightly test script to dynamically set mem_fraction [skip ci]
#4206 Porting GpuRowToColumnar converters to InternalColumnarRDDConverter
#4272 Full support for SUM overflow detection on decimal
#4255 Make regexp pattern [^a] consistent with Spark for multiline strings
#4306 Revert commonizing the int96ParquetRebase* functions
#4299 Fix auto merge conflict 4298 [skip ci]
#4159 Optimize sample perf
#4235 Commonize v2 shim
#4274 Add tests for timestamps that overflowed before.
#4271 Skip test_regexp_replace_null_pattern_fallback on Spark 3.1.1 and later
#4278 Use mamba for cudf conda install [skip ci]
#4270 Document exponent differences when casting floating point to string [skip ci]
#4268 Fix merge conflict with branch-21.12
#4093 Add tests for regexp() and regexp_like()
#4259 fix regression in cast from string to float that caused signed NaN to be considered valid
#4241 fix bug in parsing regex character classes that start with ^ and contain an unescaped ]
#4224 Support row-based Hive UDFs
#4221 GpuCast from ArrayType to StringType
#4007 Implement duplicate key handling for GpuCreateMap
#4251 Skip test_regexp_replace_null_pattern_fallback on Databricks
#4247 Disable failing CastOpSuite test
#4239 Make EOL anchor behavior match CPU for strings ending with newline
#4153 Regexp: Only transpile once per expression rather than once per batch
#4230 Change to build tools module with all the versions by default
#4223 Fixes a minor deprecation warning
#4215 Rebalance testing load
#4214 Fix pre_merge ci_2 [skip ci]
#4212 Remove an unused method with its outdated comment
#4211 Update test_floor_ceil_overflow to be more lenient on exception type
#4203 Move all the GpuShuffleExchangeExec shim v2 classes to org.apache.spark
#4193 Rename 311until320-apache to 311until320-noncdh
#4197 Ignore failing string to timestamp tests temporarily
#4160 Fix merge issues for branch 22.02
#4081 Convert String to DecimalType without casting to FloatType
#4132 Fix auto merge conflict 4131 [skip ci]
#4099 [REVIEW] Init version 22.02.0
#4113 Fix pre-merge CI 2 conditions [skip ci]

Older Releases

Changelog of older releases can be found at docs/archives