feat: add ExpandRel support to core and spark #295

andrew-coleman · 2024-09-20T09:10:01Z

This PR implements the Expand relation in the core and spark modules.
It is required in the spark module to support optimised queries that contain distinct aggregations, and increases the number of successful test cases in the TPCDS suite.

andrew-coleman · 2024-09-20T09:32:51Z

osv-scanner failed due to vulnerability in protobuf-java, so I've updated this to its latest version 3.25.5.
https://osv.dev/vulnerability/GHSA-735f-pc8j-v9w8

vbarua

Left some questions and some comments.

vbarua · 2024-09-20T15:48:58Z

core/src/main/java/io/substrait/relation/Rel.java

@@ -21,6 +22,8 @@ public interface Rel {

  List<Rel> getInputs();

+  Optional<RelCommon.Hint> getHint();


We shouldn't include references to protobuf classes in this layer. To keep this layer independent of the protobufs we should introduce a Hint class and map it to and from protobuf.

core/src/test/java/io/substrait/type/proto/ExpandRelRoundtripTest.java

vbarua · 2024-09-20T16:03:38Z

spark/src/main/scala/io/substrait/spark/logical/ToSubstraitRel.scala

@@ -72,7 +71,8 @@ class ToSubstraitRel extends AbstractLogicalPlanVisitor with Logging {
    val substraitExps = expression.aggregateFunction.children.map(toExpression(output))
    val invocation =
      SparkExtension.toAggregateFunction.apply(expression, substraitExps)
-    relation.Aggregate.Measure.builder.function(invocation).build()
+    val filter = expression.filter map toExpression(output)
+    relation.Aggregate.Measure.builder.function(invocation).preMeasureFilter(Optional.ofNullable(filter.orNull)).build()


I assume including the filter is an opportunistic bug fix?

I don't think it needs to be it's own PR, but in the future if you could make it a separate commit with a fix(spark) style message it makes it easier for me to notice and call out such changes in the release log.

Sure, I've split this into a separate commit.

core/src/main/java/io/substrait/dsl/SubstraitBuilder.java

vbarua · 2024-09-20T16:26:39Z

spark/src/main/scala/io/substrait/spark/expression/FunctionMappings.scala

@@ -67,6 +67,7 @@ class FunctionMappings {
    s[Count]("count"),
    s[Min]("min"),
    s[Max]("max"),
+    s[First]("any_value"),


While you can treat any_value as first, I don't think you can treat first as any_value. Is this mapping bi-directional?

Sure, my first proposal was to add the first function to functions_aggregate_general.yaml, but this suggestion was given as an alternative.
substrait-io/substrait#697 (comment)

Ah, if Spark itself handles First as any_value this makes sense. Though truth be told if I'm being pedantic this sounds like Spark doesn't actually support First correctly. Not a Substrait problem though.

core/src/main/java/io/substrait/relation/Expand.java

Signed-off-by: Andrew Coleman <[email protected]>

vbarua

Left some minor comments, can do a full pass tomorrow. I want to get this in this week.

vbarua · 2024-09-26T00:51:28Z

core/src/main/java/io/substrait/hint/Hint.java

+
+  public RelCommon.Hint toProto() {
+    var builder = RelCommon.Hint.newBuilder().addAllOutputNames(getOutputNames());
+    builder.setAlias(getAlias().orElse(""));


I think it's preferable to leave this unset instead of setting it to "", otherwise we can't distinguish between an unset alias and the empty string.

core/src/main/java/io/substrait/hint/Hint.java

vbarua · 2024-09-26T00:57:04Z

core/src/main/java/io/substrait/relation/Expand.java

+    return TypeCreator.of(initial.nullable())
+        .struct(
+            Stream.concat(
+                initial.fields().stream(), getFields().stream().map(ExpandField::getType)));


From https://github.com/substrait-io/substrait/blob/1f3354d9f0f8b4425e98623f34d1e4578e2142bd/proto/substrait/algebra.proto#L420-L425

// Duplicates records by emitting one or more rows per input row. The number of rows emitted per // input row is the same for all input rows. // // In addition to a field being emitted per input field an extra int64 field is emitted which // contains a zero-indexed ordinal corresponding to the duplicate definition.

It sounds like the output record types consists of

All input fields

A single extra int64 field

From what I'm seeing, you're outputting all input fields + one field per expansion field. Is that correct? I'm not super familiar with this relation.

Yes, I think you're correct

core/src/main/java/io/substrait/relation/ProtoRelConverter.java

core/src/main/java/io/substrait/relation/RelProtoConverter.java

Signed-off-by: Andrew Coleman <[email protected]>

vbarua

Left some minor comments, but nothing blocking. I will merge this before Friday at latest (I've set a reminder for myself).

vbarua · 2024-09-26T19:59:35Z

core/src/main/java/io/substrait/relation/Expand.java

+  public Type.Struct deriveRecordType() {
+    Type.Struct initial = getInput().getRecordType();
+    return TypeCreator.of(initial.nullable())
+        .struct(Stream.concat(initial.fields().stream(), Stream.of(TypeCreator.REQUIRED.I64)));


I just noticed that the protobuf comment

In addition to a field being emitted per input field an extra int64 field is emitted

Disagrees with the written spec:

The expand fields followed by an i32 column describing the index of the duplicate that the row is derived from.

We should reconcile this, eventually. I fine with picking one or the other for now but we may have to change it in the future.

Made a issue for this: substrait-io/substrait#714

vbarua · 2024-09-26T20:03:45Z

core/src/main/java/io/substrait/relation/ProtoRelConverter.java

+    builder
+        .commonExtension(optionalAdvancedExtension(rel.getCommon()))
+        .remap(optionalRelmap(rel.getCommon()))
+        .hint(optionalHint(rel.getCommon()));


Made an issue to track handling of Hint information for all relations: #297

vbarua · 2024-09-26T20:10:01Z

spark/src/main/scala/io/substrait/spark/logical/ToLogicalPlan.scala

@@ -82,11 +81,15 @@ class ToLogicalPlan(spark: SparkSession) extends DefaultRelVisitor[LogicalPlan]
        )
        throw new IllegalArgumentException(msg)
      })
+
+    val filter = Option(measure.getPreMeasureFilter.orElse(null))
+      .map(_.accept(expressionConverter))


Note to myself to include this fix in the release notes.

vbarua · 2024-09-26T20:11:45Z

spark/src/main/scala/io/substrait/spark/logical/ToLogicalPlan.scala

@@ -193,6 +196,27 @@ class ToLogicalPlan(spark: SparkSession) extends DefaultRelVisitor[LogicalPlan]
    }
  }

+  override def visit(expand: relation.Expand): LogicalPlan = {
+    val child = expand.getInput.accept(this)
+    val names = expand.getHint.get().getOutputNames.asScala


What happens if this is not set? Can you generate names for this on the fly?

I'm okay with leaving this as is for now, but generally when consuming a plan hints should be entirely optional for producers to set.

vbarua

Left some minor comments, but nothing blocking. I will merge this before Friday at latest (I've set a reminder for myself).

feat(pojo): initial support for Hint messages feat(pojo): builder support for ExpandRel feat(spark): add mapping for any_value function feat(spark): add support for consuming NullLiteral expressions feat(spark): handle filter field on Measure

andrew-coleman force-pushed the expand_relation branch from 4095757 to 6ad28ae Compare September 20, 2024 09:29

vbarua requested changes Sep 20, 2024

View reviewed changes

fix(spark): set the aggregate filter expression

79f3779

Signed-off-by: Andrew Coleman <[email protected]>

andrew-coleman force-pushed the expand_relation branch from 6ad28ae to 18bd43b Compare September 24, 2024 10:59

andrew-coleman requested a review from vbarua September 24, 2024 11:56

vbarua reviewed Sep 26, 2024

View reviewed changes

feat: add ExpandRel support to core and spark

35fde68

Signed-off-by: Andrew Coleman <[email protected]>

andrew-coleman force-pushed the expand_relation branch from 18bd43b to 35fde68 Compare September 26, 2024 13:54

andrew-coleman requested a review from vbarua September 26, 2024 20:00

vbarua mentioned this pull request Sep 26, 2024

handle hints for all relation types #297

Open

vbarua approved these changes Sep 26, 2024

View reviewed changes

vbarua merged commit 32fea18 into substrait-io:main Sep 27, 2024
12 checks passed

andrew-coleman deleted the expand_relation branch September 30, 2024 08:06

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ExpandRel support to core and spark #295

feat: add ExpandRel support to core and spark #295

andrew-coleman commented Sep 20, 2024

andrew-coleman commented Sep 20, 2024 •

edited

Loading

vbarua left a comment

vbarua Sep 20, 2024

vbarua Sep 20, 2024

andrew-coleman Sep 24, 2024

vbarua Sep 20, 2024 •

edited

Loading

andrew-coleman Sep 24, 2024 •

edited

Loading

vbarua Sep 25, 2024

vbarua left a comment

vbarua Sep 26, 2024

vbarua Sep 26, 2024

andrew-coleman Sep 26, 2024

vbarua left a comment

vbarua Sep 26, 2024

vbarua Sep 26, 2024

vbarua Sep 26, 2024

vbarua Sep 26, 2024

vbarua left a comment

		@@ -21,6 +22,8 @@ public interface Rel {

		List<Rel> getInputs();

		Optional<RelCommon.Hint> getHint();

feat: add ExpandRel support to core and spark #295

feat: add ExpandRel support to core and spark #295

Conversation

andrew-coleman commented Sep 20, 2024

andrew-coleman commented Sep 20, 2024 • edited Loading

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua Sep 20, 2024 • edited Loading

Choose a reason for hiding this comment

andrew-coleman Sep 24, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

andrew-coleman commented Sep 20, 2024 •

edited

Loading

vbarua Sep 20, 2024 •

edited

Loading

andrew-coleman Sep 24, 2024 •

edited

Loading