-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: add ALL/DISTINCT modifier for all set operation types #708
Conversation
…stinct/all SQL modifiers
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The protobuf changes look good to me, but we should updated the documentation for SetRel as well
substrait/site/docs/relations/logical_relations.md
Lines 248 to 278 in bc4d6fb
## Set Operation | |
The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching. | |
| Signature | Value | | |
| -------------------- | ------------------------------------------------------------ | | |
| Inputs | 2 or more | | |
| Outputs | 1 | | |
| Property Maintenance | Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. | | |
| Direct Output Order | The field order of the inputs. All inputs must have identical field *types*, but field nullabilities may vary. | | |
### Set Properties | |
| Property | Description | Required | | |
| ------------------ | --------------------------------- | --------------------- | | |
| Primary Input | The primary input of the dataset. | Required | | |
| Secondary Inputs | One or more relational inputs. | At least one required | | |
| Set Operation Type | From list below. | Required | | |
### Set Operation Types | |
The set operation type determines both the records that are emitted and the type of the output record. | |
| Property | Description | Output Nullability | |
| ----------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- | | |
| Minus (Primary) | Returns all records from the primary input excluding any matching records from secondary inputs. | The same as the primary input. | |
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | The same as the primary input. | |
| Intersection (Primary) | Returns all records from the primary input that match at least one record from *any* secondary inputs. | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. | |
| Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs. | If a field is required in any of the inputs, it is required in the output. | |
| Union Distinct | Returns all the records from each set, removing any rows that are duplicated (within or across sets). | If a field is nullable in any of the inputs, it is nullable in the output. | |
| Union All | Returns all records from each set, allowing duplicates. | If a field is nullable in any of the inputs, it is nullable in the output. | |
A few thoughts:
|
Spark Connect puts the distinctness on the function instead of the relation: |
We already have distinct on the function ( substrait/proto/substrait/algebra.proto Line 1566 in bc4d6fb
Also, it doesn't help with set ops since they aren't functions.
This would be good to document as it would have been a surprise to me (though I just double checked and that does match the postgres definition). The current description (as I interpret it) is NOT deduplicating (at least, it wouldn't deduplicate the primary). We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.
What does
If I issue the query
Why does (and indeed, the query |
Agree on keeping existing things working and less refactoring. It will be a pretty small change then
@westonpace You might be thinking of Cartesian product! I'm pretty sure for
we would get:
because although 1 has two combinations the minimum times it appears in both tables is once so only one 1 in the output. For distinct, duplicates are just removed after:
|
Thanks, I agree that matches my testing. Let's make sure to include that in the description when the spec part is updated. I appreciate the explanation! |
Like for @westonpace, this is was also a surprise for me based on my reading of the Substrait spec. For example for
I assumed all records here included dupes (i.e ALL not DISTINCT). If we want to treat @kadinrabo TIL about the INTERSECT ALL behaviour. Per the spec as you indicated:
we should probably make sure the spec docs match the EXCEPT ALL and INTERSECT ALL behaviour described here.
@jacques-n that sounds reasonable. I was hoping to avoid a future were we end up with something like: enum SetOp {
SET_OP_UNSPECIFIED = 0;
SET_OP_MINUS_PRIMARY = 1;
SET_OP_MINUS_PRIMARY_ALL = ???;
SET_OP_MINUS_MULTISET = 2;
SET_OP_MINUS_MULTISET_ALL = ???;
SET_OP_INTERSECTION_PRIMARY = 3;
SET_OP_INTERSECTION_PRIMARY_ALL = ???;
SET_OP_INTERSECTION_MULTISET = 4;
SET_OP_INTERSECTION_MULTISET_ALL = ???;
SET_OP_UNION_DISTINCT = 5;
SET_OP_UNION_ALL = 6;
} with two enum values per set operation, but that may not be a huge deal and also means we can defer defining the meaning of |
proto/substrait/algebra.proto
Outdated
@@ -338,6 +338,8 @@ message SetRel { | |||
SET_OP_INTERSECTION_MULTISET = 4; | |||
SET_OP_UNION_DISTINCT = 5; | |||
SET_OP_UNION_ALL = 6; | |||
SET_OP_MINUS_DISTINCT = 7; | |||
SET_OP_INTERSECTION_DISTINCT = 8; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
to me, the changes here and below are the opposite of what we were saying.
If default is deduplicating, then you should be introducing MINUS ALL, not MINUS DISTINCT, etc.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Defining it how @vbarua suggested makes most sense to me too I was just trying to minimize the effect of my changes because I also initially interpreted Minus (Primary)
to include duplicates according to the current spec.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm 👍 for treating SET_OP_MINUS_PRIMARY
, SET_OP_MINUS_MULTISET
, SET_OP_INTERSECTION_PRIMARY
and SET_OP_INTERSECTION_MULTISET
as deduplicating that makes more sense although it's longer
Yep
Agreed |
@vbarua , if we were starting from scratch I would agree. Since we already have a bunch of people writing this, I'm inclined to stick with the current pattern. Besides, enums are cheap, right? :D |
I added the new enum values at the end to keep the existing ones unchanged for people already referencing them |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
To check something, does it make sense for our versions of INTERSECT ALL and MINUS ALL to follow from the SQL spec?
Intersection All looks at all occurrences of values in both tables, but the output is limited by the smallest number of matches between the two: | ||
``` | ||
{1, 3, 2, 2, 2} INTERSECTION ALL {1, 1, 2, 3, 2} === {1, 3, 2, 2} | ||
{1, 3, 2, 2, 2} INTERSECTION DISTINCT {1, 1, 2, 3, 2} === {1, 3, 2} |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In my opinion this should be inlined into the description, and we should also have updated descriptions for
- Minus All
- Minus Multiset All
- Intersection All
- Intersection Multiset All
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Overall looks really good. I noticed one small bit of weirdness around the INTERSECTION ops. Left a big comment about it, but would also be interested to see if @jacques-n @westonpace @EpsilonPrime have any additional input on it.
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2}<br/> n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/> m: {1, 2, 2, 3, 3, 3, 3}<br/> n1: {1, 2, 3, 4}<br/> n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input. | ||
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/> m: {1, 2, 3, 4}<br/> n1: {1, 2}<br/> n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2, 3, 5}<br/> n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Returns all records from the primary input that are present in every secondary input.
This used to be any secondary input and not every secondary input, so this changes the definition beyond just making deduplication explicit.
I'm also realising that while
<A> INTERSECT <B>
can map to SET_OP_INTERSECTION_PRIMARY
<A> INTERSECT <B> INTERSECT <C>`
doesn't, because the behaviour of chained intersects like this is consistent with SET_OP_INTERSECTION_MULTISET
. SQL INTERSECT
should generally map to this.
Taking this further,
<A> INTERSECT ALL <B> INTERSECT ALL <C>
would need to be mapped to something like
SET_OP_INTERSECTION_MULTISET_ALL
instead of
SET_OP_INTERSECTION_PRIMARY_ALL
I think it might make sense to define
SET_OP_INTERSECTION_MULTISET_ALL
and not
SET_OP_INTERSECTION_PRIMARY_ALL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the new definition of Intersection (Primary)
is incorrect.
Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL
and SET_OP_INTERSECTION_PRIMARY_ALL
I agree with your logic that, if we are adding only one, MULTISET_ALL
makes more sense.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL
is ambiguous. I could interpret it as:
min(m, max(n1, n2, ..., n))
or...
min(m, sum(n1, n2, ..., n))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm inclined to not handle SET_OP_INTERSECTION_PRIMARY_ALL
as part of this PR because I think figuring out which of those 2 behaviours is the "correct" one seems like it's own thing.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL
and SET_OP_INTERSECTION_MULTISET_ALL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL
Yes, we should leave the other variants for their own PR (if anyone ends up needing/wanting them).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree with @vbarua 's logic. Let's fix the definition of Intersection (Primary)
and change Intersection (Primary All)
into Intersection (Multiset All)
.
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2}<br/> n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/> m: {1, 2, 2, 3, 3, 3, 3}<br/> n1: {1, 2, 3, 4}<br/> n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input. | ||
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/> m: {1, 2, 3, 4}<br/> n1: {1, 2}<br/> n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2, 3, 5}<br/> n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I agree that the new definition of Intersection (Primary)
is incorrect.
Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL
and SET_OP_INTERSECTION_PRIMARY_ALL
I agree with your logic that, if we are adding only one, MULTISET_ALL
makes more sense.
| Minus (Primary) | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2}<br/> n2: {3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Minus (Primary All) | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies. | MINUS ALL<br/> m: {1, 2, 2, 3, 3, 3, 3}<br/> n1: {1, 2, 3, 4}<br/> n2: {3}<br/>YIELDS<br/>{2, 3, 3} | The same as the primary input. | ||
| Minus (Multiset) | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | MINUS MULTISET<br/> m: {1, 2, 3, 4}<br/> n1: {1, 2}<br/> n2: {1, 2, 3}<br/>YIELDS<br/>{4} | The same as the primary input. | ||
| Intersection (Primary) | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets. | INTERSECT<br/> m: {1, 2, 2, 3, 3, 3, 4}<br/> n1: {1, 2, 3, 5}<br/> n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3} | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL
is ambiguous. I could interpret it as:
min(m, max(n1, n2, ..., n))
or...
min(m, sum(n1, n2, ..., n))
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Some minor comments and one nullability change, but otherwise this looks good to me.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks for the quick updated @kadinrabo
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
One small nit but otherwise looks good. Thanks for doing this!
Co-authored-by: Weston Pace <[email protected]>
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good. Thanks for the helpful documentation!
Today the SetRel spec contains
Intersection
andMinus
supporting 1 or more secondary inputs, but it doesn’t consider distinct/all.Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.
Eventually this would lead to the deprecation of union distinct/all. My initial inquiry