Improve memory usage in scio-extra's sortValues API #5338

clairemcginty · 2024-04-16T19:22:14Z

A pipeline on Scio 0.14.3 migrated from applying Beam's SortValues transform directly:

data
  .map { element => KV.of(..., KV.of(..., ...)) }
  .applyTransform(GroupByKey.create())
  .setCoder(...)
  .applyTransform(
     SortValues.create(
       BufferedExternalSorter
         .options()
         .withMemoryMB(2047)
         .withExternalSorterType(ExternalSorter.Options.SorterType.NATIVE)
        )
   )

to the scio-extra API:

data
  .map { element => (..., (..., ...)) }
  .groupByKey
  .sortValues(2047)

and immediately ran out of memory on DF:

Caused by: java.lang.OutOfMemoryError: GC overhead limit exceeded
	org.apache.beam.runners.dataflow.worker.ActiveMessageMetadata.create(ActiveMessageMetadata.java:30)
	org.apache.beam.runners.dataflow.worker.DataflowExecutionContext$DataflowExecutionStateTracker.enterState(DataflowExecutionContext.java:320)
	org.apache.beam.runners.dataflow.worker.GroupingShuffleReader$GroupingShuffleReaderIterator$ValuesIterator.hasNext(GroupingShuffleReader.java:429)
	org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.PeekingReiterator.computeNext(PeekingReiterator.java:98)
	org.apache.beam.runners.dataflow.worker.repackaged.org.apache.beam.runners.core.PeekingReiterator.hasNext(PeekingReiterator.java:52)
	org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn$WindowReiterator.skipToValidElement(BatchGroupAlsoByWindowViaIteratorsFn.java:232)
	org.apache.beam.runners.dataflow.worker.util.BatchGroupAlsoByWindowViaIteratorsFn$WindowReiterator.hasNext(BatchGroupAlsoByWindowViaIteratorsFn.java:208)
	scala.collection.convert.Wrappers$JIteratorWrapper.hasNext(Wrappers.scala:45)
	scala.collection.Iterator.foreach(Iterator.scala:943)
	scala.collection.Iterator.foreach$(Iterator.scala:943)
	scala.collection.AbstractIterator.foreach(Iterator.scala:1431)
	scala.collection.IterableLike.foreach(IterableLike.scala:74)
	scala.collection.IterableLike.foreach$(IterableLike.scala:73)
	scala.collection.AbstractIterable.foreach(Iterable.scala:56)
	scala.collection.TraversableLike.map(TraversableLike.scala:286)
	scala.collection.TraversableLike.map$(TraversableLike.scala:279)
	scala.collection.AbstractTraversable.map(Traversable.scala:108)
	com.spotify.scio.extra.sorter.syntax.SorterOps$$anonfun$sortValues$1$$anonfun$apply$1.apply(SCollectionSyntax.scala:68)
	com.spotify.scio.extra.sorter.syntax.SorterOps$$anonfun$sortValues$1$$anonfun$apply$1.apply(SCollectionSyntax.scala:68)
	com.spotify.scio.util.Functions$$anon$8.processElement(Functions.scala:278)
	com.spotify.scio.util.Functions$$anon$8$DoFnInvoker.invokeProcessElement(Unknown Source)

The SorterOps line in question is where we convert the Scala iterable to a Java one: https://github.com/spotify/scio/blob/v0.14.3/scio-extra/src/main/scala/com/spotify/scio/extra/sorter/syntax/SCollectionSyntax.scala#L68

Can we make Scio's sortValues more memory-efficient by, ideally, avoiding materializing the output of GroupByKey, which might be lazy? Perhaps we could make a new API, .groupByKeyAndSortValues, that applies GBK + SortValues directly, without translating back and forth between Java/Scala iterables.

The text was updated successfully, but these errors were encountered:

clairemcginty added the enhancement New feature or request label Apr 16, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve memory usage in scio-extra's sortValues API #5338

Improve memory usage in scio-extra's sortValues API #5338

clairemcginty commented Apr 16, 2024

Improve memory usage in scio-extra's sortValues API #5338

Improve memory usage in scio-extra's sortValues API #5338

Comments

clairemcginty commented Apr 16, 2024