[FEA] Support pre-split in GPU project exec #11916

firestarman · 2025-01-06T07:56:37Z

Is your feature request related to a problem? Please describe.
We met a CPU OOM due to a quite large batch(~5.4G), who has more than 250 columns.

24/12/22 09:34:16 ERROR Utils: Aborting task
com.nvidia.spark.rapids.jni.CpuRetryOOM: CPU OutOfMemory: Could not split the current attempt: {GpuSCB size:5475284352, handle:buffer 
handle TempSpillBufferId(109845,temp_local_fae8ba55-603b-4ddf-b906-079e0717c3cb) at 9223372036854774808, rows:2541469, 
types:List(LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType,
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType,
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType,
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType,
LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, 
LongType, LongType, LongType, LongType, LongType, LongType, LongType, StringType, IntegerType, IntegerType), refCount:1}

After checking the query eventlog, we found there is a big projection after a symmetric join. This big projection was trying to build about 266 columns from a batch with only about 50 columns. So the projected batch size (~5G) grew up to about 5 times of the input batch size (1G).

Describe the solution you'd like
Add the pre-split support to GPU project exe, similarly as what we have done in GPU aggregate exec to avoid producing large batches after some aggregations.

Additional context
The exception call stack

24/12/22 09:34:16 ERROR Utils: Aborting task
com.nvidia.spark.rapids.jni.CpuRetryOOM: CPU OutOfMemory: Could not split the current attempt: {GpuSCB size:5475284352, handle:buffer handle TempSpillBufferId(109845,temp_local_fae8ba55-603b-4ddf-b906-079e0717c3cb) at 9223372036854774808, rows:2541469, types:List(LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, LongType, StringType, IntegerType, IntegerType), refCount:1}
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:494)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryIterator.next(RmmRapidsRetryIterator.scala:624)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$RmmRapidsRetryAutoCloseableIterator.next(RmmRapidsRetryIterator.scala:553)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$splitOneSortedBatch$1(GpuSortExec.scala:463)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$splitOneSortedBatch$1$adapted(GpuSortExec.scala:455)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.splitOneSortedBatch(GpuSortExec.scala:455)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.firstPassReadBatches(GpuSortExec.scala:474)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.$anonfun$next$4(GpuSortExec.scala:622)
	at com.nvidia.spark.rapids.Arm$.closeOnExcept(Arm.scala:98)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:618)
	at com.nvidia.spark.rapids.GpuOutOfCoreSortIterator.next(GpuSortExec.scala:284)
	at org.apache.spark.sql.rapids.GpuFileFormatDataWriter.writeWithIterator(GpuFileFormatDataWriter.scala:179)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$executeTask$1(GpuFileFormatWriter.scala:335)
	at org.apache.spark.util.Utils$.tryWithSafeFinallyAndFailureCallbacks(Utils.scala:1652)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.executeTask(GpuFileFormatWriter.scala:342)
	at org.apache.spark.sql.rapids.GpuFileFormatWriter$.$anonfun$write$14(GpuFileFormatWriter.scala:260)
	at org.apache.spark.scheduler.ResultTask.runTask(ResultTask.scala:90)
	at org.apache.spark.scheduler.Task.run(Task.scala:134)
	at org.apache.spark.executor.Executor$TaskRunner.$anonfun$run$3(Executor.scala:538)
	at org.apache.spark.util.Utils$.tryWithSafeFinally(Utils.scala:1618)
	at org.apache.spark.executor.Executor$TaskRunner.run(Executor.scala:541)
	at java.util.concurrent.ThreadPoolExecutor.runWorker(ThreadPoolExecutor.java:1149)
	at java.util.concurrent.ThreadPoolExecutor$Worker.run(ThreadPoolExecutor.java:624)
	at java.lang.Thread.run(Thread.java:750)
Caused by: com.nvidia.spark.rapids.jni.CpuRetryOOM: Could not complete allocation after 1000 retries
	at com.nvidia.spark.rapids.HostAlloc.alloc(HostAlloc.scala:246)
	at com.nvidia.spark.rapids.HostAlloc$.alloc(HostAlloc.scala:278)
	at com.nvidia.spark.rapids.RapidsDiskStore$RapidsDiskBuffer.getMemoryBuffer(RapidsDiskStore.scala:169)
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.materializeMemoryBuffer(RapidsBufferStore.scala:438)
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getDeviceMemoryBuffer(RapidsBufferStore.scala:512)
	at com.nvidia.spark.rapids.RapidsBufferStore$RapidsBufferBase.getColumnarBatch(RapidsBufferStore.scala:452)
	at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.$anonfun$getColumnarBatch$1(SpillableColumnarBatch.scala:127)
	at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.$anonfun$withRapidsBuffer$1(SpillableColumnarBatch.scala:110)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.withRapidsBuffer(SpillableColumnarBatch.scala:109)
	at com.nvidia.spark.rapids.SpillableColumnarBatchImpl.getColumnarBatch(SpillableColumnarBatch.scala:125)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.$anonfun$splitSpillableInHalfByRows$2(RmmRapidsRetryIterator.scala:714)
	at com.nvidia.spark.rapids.Arm$.withResource(Arm.scala:30)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$.$anonfun$splitSpillableInHalfByRows$1(RmmRapidsRetryIterator.scala:708)
	at com.nvidia.spark.rapids.RmmRapidsRetryIterator$AutoCloseableAttemptSpliterator.split(RmmRapidsRetryIterator.scala:478)
	... 24 more

The text was updated successfully, but these errors were encountered:

revans2 · 2025-01-06T16:49:26Z

For a project we just need to be careful with window operations. We also need to be careful with the performance impact that this can have.

For some window operations we need to have the entire window in a single batch to be able to process the data. That is archived with the childrenCoalesceGoal. We also can tell other nodes in the plan what our output batching is like with the outputBatching method. For example.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/window/GpuWindowExec.scala

Lines 156 to 162 in 4df6d60

    
           override def childrenCoalesceGoal: Seq[CoalesceGoal] = Seq(outputBatching) 
        
           override def outputBatching: CoalesceGoal = if (gpuPartitionSpec.isEmpty) { 
        
             RequireSingleBatch 
        
           } else { 
        
             BatchedByKey(gpuPartitionOrdering)(cpuPartitionOrdering) 
        
           }

A ProjectExec can be inserted before the window and after the sort in some cases.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/window/GpuWindowExecMeta.scala

Line 145 in 4df6d60

GpuProjectExec(pre.toList, childPlans.head.convertIfNeeded())

We mainly need to make sure that if we can split the input batch into smaller batches that we do not mark Project as preserving the batching.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

Line 223 in 4df6d60

override def outputPartitioning: Partitioning = child.outputPartitioning

If we update the code to do a pre-project split we could also update it to split on retry as well.

spark-rapids/sql-plugin/src/main/scala/com/nvidia/spark/rapids/basicPhysicalOperators.scala

Line 395 in 4df6d60

boundProjectList.projectAndCloseWithRetrySingleBatch(sb)

When I first put in the pre-project split code for hash aggregate I also implemented it for project, but I saw a large performance regression so I reverted that part of the code. Please make sure that we measure this performance regression, especially around window operations that require that all of the data for a window be in a single batch.

firestarman · 2025-01-14T01:24:44Z

After enabling the pre-split for Project, we met another case there are some exrepssions of complex type in the project list, but their sizes are wrongly estimated, then it produced a ~3GB batch, leading to OOM when performing the project.
A toy query to reproduce this issue is followed.

scala> spark.range(100).selectExpr("cast(id as long)").createOrReplaceTempView("tt")

scala> sql("select map('sid', id, '1', id), struct('s1', id), array(1, id) from tt").collect
25/01/13 06:31:13 WARN GpuOverrides: 
*Exec <ProjectExec> will run on GPU
  *Expression <Alias> map(sid, id#0L, 1, id#0L) AS map(sid, id, 1, id)#4 will run on GPU
    *Expression <CreateMap> map(sid, id#0L, 1, id#0L) will run on GPU
  *Expression <Alias> struct(col1, s1, id, id#0L) AS struct(s1, id)#5 will run on GPU
    *Expression <CreateNamedStruct> struct(col1, s1, id, id#0L) will run on GPU
  *Expression <Alias> array(1, id#0L) AS array(1, id)#6 will run on GPU
    *Expression <CreateArray> array(1, id#0L) will run on GPU
  *Exec <RangeExec> will run on GPU

===>estimated size: 1204, actual size: 3316, splitUntilSize 1.395081216E9, numSplits 1

We need to react the estimation code to cover these cases at least.

firestarman added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 6, 2025

firestarman changed the title ~~[FEA] OOM due a big projection~~ [FEA] OOM due to a big projection Jan 6, 2025

firestarman changed the title ~~[FEA] OOM due to a big projection~~ [FEA] Support pre-split in GPU project exec Jan 6, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 7, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Support pre-split in GPU project exec #11916

[FEA] Support pre-split in GPU project exec #11916

firestarman commented Jan 6, 2025 •

edited

Loading

revans2 commented Jan 6, 2025

firestarman commented Jan 14, 2025 •

edited

Loading

[FEA] Support pre-split in GPU project exec #11916

[FEA] Support pre-split in GPU project exec #11916

Comments

firestarman commented Jan 6, 2025 • edited Loading

revans2 commented Jan 6, 2025

firestarman commented Jan 14, 2025 • edited Loading

firestarman commented Jan 6, 2025 •

edited

Loading

firestarman commented Jan 14, 2025 •

edited

Loading