feat: add ALL/DISTINCT modifier for all set operation types #708

kadinrabo · 2024-09-18T13:26:52Z

Today the SetRel spec contains Intersection and Minus supporting 1 or more secondary inputs, but it doesn’t consider distinct/all.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

Eventually this would lead to the deprecation of union distinct/all. My initial inquiry

…stinct/all SQL modifiers

CLAassistant · 2024-09-18T13:26:58Z

All committers have signed the CLA.

proto/substrait/algebra.proto

vbarua

The protobuf changes look good to me, but we should updated the documentation for SetRel as well

substrait/site/docs/relations/logical_relations.md

Lines 248 to 278 in bc4d6fb

    
           ## Set Operation 
        
           The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching. 
        
           | Signature            | Value                                                        | 
        
           | -------------------- | ------------------------------------------------------------ | 
        
           | Inputs               | 2 or more                                                    | 
        
           | Outputs              | 1                                                            | 
        
           | Property Maintenance | Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. | 
        
           | Direct Output Order  | The field order of the inputs. All inputs must have identical field *types*, but field nullabilities may vary. | 
        
           ### Set Properties 
        
           | Property           | Description                       | Required              | 
        
           | ------------------ | --------------------------------- | --------------------- | 
        
           | Primary Input      | The primary input of the dataset. | Required              | 
        
           | Secondary Inputs   | One or more relational inputs.    | At least one required | 
        
           | Set Operation Type | From list below.                  | Required              | 
        
           ### Set Operation Types 
        
           The set operation type determines both the records that are emitted and the type of the output record. 
        
           | Property                | Description                                                                                                   | Output Nullability 
        
           | ----------------------- | ------------------------------------------------------------------------------------------------------------- | ----------------------------- | 
        
           | Minus (Primary)         | Returns all records from the primary input excluding any matching records from secondary inputs.              | The same as the primary input. 
        
           | Minus (Multiset)        | Returns all records from the primary input excluding any records that are included in *all* secondary inputs. | The same as the primary input. 
        
           | Intersection (Primary)  | Returns all records from the primary input that match at least one record from *any* secondary inputs.        | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. 
        
           | Intersection (Multiset) | Returns all records from the primary input that match at least one record from *all* secondary inputs.        | If a field is required in any of the inputs, it is required in the output. 
        
           | Union Distinct          | Returns all the records from each set, removing any rows that are duplicated (within or across sets).         | If a field is nullable in any of the inputs, it is nullable in the output. 
        
           | Union All               | Returns all records from each set, allowing duplicates.                                                       | If a field is nullable in any of the inputs, it is nullable in the output. |

jacques-n · 2024-09-19T01:21:17Z

A few thoughts:

I'm not a fan of refactoring where it isn't necessary (e.g. creating something that needs to impls to handle compatibility cases). In this case I would prefer to keep the existing enums and just add more. That means everything in existence already would work without changes (and no compat handling code).
Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.
The spec update should drive the protobuf update. (What @vbarua said about lets make sure the spec is clear.)
I do wonder if we should define a Distinct relation, as well. Doesn't need to be addressed here but the Aggregate with all columns and no measures seems excessive.

EpsilonPrime · 2024-09-19T01:28:23Z

Spark Connect puts the distinctness on the function instead of the relation:

https://github.com/apache/spark/blob/branch-3.5/connector/connect/common/src/main/protobuf/spark/connect/expressions.proto#L236

westonpace · 2024-09-19T01:57:19Z

Spark Connect puts the distinctness on the function instead of the relation:

We already have distinct on the function (

substrait/proto/substrait/algebra.proto

Line 1566 in bc4d6fb

AggregationInvocation invocation = 6;

). I think @jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation.

Also, it doesn't help with set ops since they aren't functions.

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

This would be good to document as it would have been a surprise to me (though I just double checked and that does match the postgres definition). The current description (as I interpret it) is NOT deduplicating (at least, it wouldn't deduplicate the primary). We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Union distinct/all were considered common enough operations to be included, but this PR is necessary to support sql intersect/except distinct/all without performing the set op and distinct/all as separate stages.

What does INTERSECT ALL do in the presence of duplicates? It seems ill defined in postgres. Given the table:

a	b
3	1
3	1
2	2
2	3
2	2

If I issue the query SELECT a FROM table INTERSECT ALL SELECT b from table I get:

RESULT
3
2
2

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

(and indeed, the query SELECT t1.a, t2.b FROM table as t1 CROSS JOIN table as t2 WHERE t1.a = t2.b has 8 results)

kadinrabo · 2024-09-19T16:24:32Z

Agree on keeping existing things working and less refactoring. It will be a pretty small change then

SET_OP_MINUS_DISTINCT = 7;
SET_OP_INTERSECTION_DISTINCT = 8;

Why does 2 show up twice and 3 only shows up once? If I were to consider all combinations I think I would have two combinations of (3, 3) and six combinations of (2, 2).

@westonpace You might be thinking of Cartesian product! I'm pretty sure for INTERSECT ALL, the number of times a value appears in the final output is determined by the column with the fewer occurrences of that value in both tables. It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two. For example for this:

a	b
1	1
3	1
2	2
2	3
2	2

we would get:

RESULT
1
3
2
2

because although 1 has two combinations the minimum times it appears in both tables is once so only one 1 in the output. For distinct, duplicates are just removed after:

RESULT
1
3
2

westonpace · 2024-09-19T16:46:54Z

It looks at all occurrences of the values in both tables, but the output is limited by the smallest number of matches between the two

Thanks, I agree that matches my testing. Let's make sure to include that in the description when the spec part is updated. I appreciate the explanation!

vbarua · 2024-09-19T16:58:44Z

Because MINUS/EXCEPT and INTERSECT in SQL are deduplicating by convention (unless specifically marked all), I believe the intention for the existing ones should be the same.

Like for @westonpace, this is was also a surprise for me based on my reading of the Substrait spec. For example for Minus (Primary)

Returns all records from the primary input excluding any matching records from secondary inputs.

I assumed all records here included dupes (i.e ALL not DISTINCT).

If we want to treat SET_OP_MINUS_PRIMARY, SET_OP_MINUS_MULTISET, SET_OP_INTERSECTION_PRIMARY and SET_OP_INTERSECTION_MULTISET as deduplicating, that sounds reasonable to me but we should make it more explicit in the spec because as written I don't think that's how they behave.

@kadinrabo TIL about the INTERSECT ALL behaviour. Per the spec as you indicated:
https://www.contrib.andrew.cmu.edu/~shadow/sql/sql1992.txt

            b) If a set operator is specified, then the result of applying
              the set operator is a table containing the following rows:

              i) Let R be a row that is a duplicate of some row in T1 or of
                 some row in T2 or both. Let m be the number of duplicates
                 of R in T1 and let n be the number of duplicates of R in
                 T2, where m � 0 and n � 0.

            ...            

            iii) If ALL is specified, then

                 Case:

                 1) If UNION is specified, then the number of duplicates of
                   R that T contains is (m + n).

                 2) If EXCEPT is specified, then the number of duplicates of
                   R that T contains is the maximum of (m - n) and 0.

                 3) If INTERSECT is specified, then the number of duplicates
                   of R that T contains is the minimum of m and n.

we should probably make sure the spec docs match the EXCEPT ALL and INTERSECT ALL behaviour described here.

In this case I would prefer to keep the existing enums and just add more.

@jacques-n that sounds reasonable. I was hoping to avoid a future were we end up with something like:

  enum SetOp {
    SET_OP_UNSPECIFIED = 0;
    SET_OP_MINUS_PRIMARY = 1;
    SET_OP_MINUS_PRIMARY_ALL = ???;
    SET_OP_MINUS_MULTISET = 2;
    SET_OP_MINUS_MULTISET_ALL = ???;
    SET_OP_INTERSECTION_PRIMARY = 3;
    SET_OP_INTERSECTION_PRIMARY_ALL = ???;
    SET_OP_INTERSECTION_MULTISET = 4;
    SET_OP_INTERSECTION_MULTISET_ALL = ???;
    SET_OP_UNION_DISTINCT = 5;
    SET_OP_UNION_ALL = 6;
  }

with two enum values per set operation, but that may not be a huge deal and also means we can defer defining the meaning of SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL to a future time.

jacques-n · 2024-09-19T18:27:23Z

proto/substrait/algebra.proto

@@ -338,6 +338,8 @@ message SetRel {
    SET_OP_INTERSECTION_MULTISET = 4;
    SET_OP_UNION_DISTINCT = 5;
    SET_OP_UNION_ALL = 6;
+    SET_OP_MINUS_DISTINCT = 7;
+    SET_OP_INTERSECTION_DISTINCT = 8;


to me, the changes here and below are the opposite of what we were saying.

If default is deduplicating, then you should be introducing MINUS ALL, not MINUS DISTINCT, etc.

Defining it how @vbarua suggested makes most sense to me too I was just trying to minimize the effect of my changes because I also initially interpreted Minus (Primary) to include duplicates according to the current spec.

I'm 👍 for treating SET_OP_MINUS_PRIMARY, SET_OP_MINUS_MULTISET, SET_OP_INTERSECTION_PRIMARY and SET_OP_INTERSECTION_MULTISET as deduplicating that makes more sense although it's longer

jacques-n · 2024-09-19T18:28:35Z

@jacques-n point was more that a "distinct with no functions" is a common enough use case to warrant a relation

Yep

We should be make sure Substrait is clear enough to not require knowledge of the SQL spec.

Agreed

jacques-n · 2024-09-19T18:43:56Z

I was hoping to avoid a future were we end up with something like... [long list ensues]

@vbarua , if we were starting from scratch I would agree. Since we already have a bunch of people writing this, I'm inclined to stick with the current pattern. Besides, enums are cheap, right? :D

kadinrabo · 2024-09-20T14:50:45Z

I added the new enum values at the end to keep the existing ones unchanged for people already referencing them

vbarua

To check something, does it make sense for our versions of INTERSECT ALL and MINUS ALL to follow from the SQL spec?

cc: @jacques-n @westonpace @EpsilonPrime

vbarua · 2024-09-23T14:57:09Z

site/docs/relations/logical_relations.md

+Intersection All looks at all occurrences of values in both tables, but the output is limited by the smallest number of matches between the two:
+```
+{1, 3, 2, 2, 2} INTERSECTION ALL {1, 1, 2, 3, 2} === {1, 3, 2, 2}
+{1, 3, 2, 2, 2} INTERSECTION DISTINCT {1, 1, 2, 3, 2} === {1, 3, 2}


In my opinion this should be inlined into the description, and we should also have updated descriptions for

Minus All

Minus Multiset All

Intersection All

Intersection Multiset All

vbarua

Overall looks really good. I noticed one small bit of weirdness around the INTERSECTION ops. Left a big comment about it, but would also be interested to see if @jacques-n @westonpace @EpsilonPrime have any additional input on it.

site/docs/relations/logical_relations.md

vbarua · 2024-09-26T18:39:09Z

site/docs/relations/logical_relations.md

+| Minus (Primary)            | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4}                                            | The same as the primary input.                                                                                                                                                                     
+| Minus (Primary All)        | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies.                                                             | MINUS ALL<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3}                            | The same as the primary input.
+| Minus (Multiset)           | Returns all records from the primary input excluding any records that are included in *all* secondary inputs.                                                                                                                                                 | MINUS MULTISET<br/>&nbsp;&nbsp;m:  {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4}                                      | The same as the primary input.
+| Intersection (Primary)     | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets.                                                                                                                          | INTERSECT<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3}                         | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. 


Returns all records from the primary input that are present in every secondary input.

This used to be any secondary input and not every secondary input, so this changes the definition beyond just making deduplication explicit.

I'm also realising that while

<A> INTERSECT <B>

can map to SET_OP_INTERSECTION_PRIMARY

<A> INTERSECT <B> INTERSECT <C>`

doesn't, because the behaviour of chained intersects like this is consistent with SET_OP_INTERSECTION_MULTISET. SQL INTERSECT should generally map to this.

Taking this further,

<A> INTERSECT ALL <B> INTERSECT ALL <C>

would need to be mapped to something like

SET_OP_INTERSECTION_MULTISET_ALL

instead of

SET_OP_INTERSECTION_PRIMARY_ALL

I think it might make sense to define

SET_OP_INTERSECTION_MULTISET_ALL

and not

SET_OP_INTERSECTION_PRIMARY_ALL

I agree that the new definition of Intersection (Primary) is incorrect.

Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL and SET_OP_INTERSECTION_PRIMARY_ALL

I agree with your logic that, if we are adding only one, MULTISET_ALL makes more sense.

FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL is ambiguous. I could interpret it as:

min(m, max(n1, n2, ..., n))

or...

min(m, sum(n1, n2, ..., n))

I'm inclined to not handle SET_OP_INTERSECTION_PRIMARY_ALL as part of this PR because I think figuring out which of those 2 behaviours is the "correct" one seems like it's own thing.

Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL

Agree but to confirm: the only new additions will be SET_OP_MINUS_PRIMARY_ALL and SET_OP_INTERSECTION_MULTISET_ALL

Yes, we should leave the other variants for their own PR (if anyone ends up needing/wanting them).

site/docs/relations/logical_relations.md

westonpace

I agree with @vbarua 's logic. Let's fix the definition of Intersection (Primary) and change Intersection (Primary All) into Intersection (Multiset All).

site/docs/relations/logical_relations.md

westonpace · 2024-09-27T16:20:23Z

site/docs/relations/logical_relations.md

+| Minus (Primary)            | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4}                                            | The same as the primary input.                                                                                                                                                                     
+| Minus (Primary All)        | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies.                                                             | MINUS ALL<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3}                            | The same as the primary input.
+| Minus (Multiset)           | Returns all records from the primary input excluding any records that are included in *all* secondary inputs.                                                                                                                                                 | MINUS MULTISET<br/>&nbsp;&nbsp;m:  {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4}                                      | The same as the primary input.
+| Intersection (Primary)     | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets.                                                                                                                          | INTERSECT<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3}                         | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. 


I agree that the new definition of Intersection (Primary) is incorrect.

Why aren't we adding both SET_OP_INTERSECTION_MULTISET_ALL and SET_OP_INTERSECTION_PRIMARY_ALL

I agree with your logic that, if we are adding only one, MULTISET_ALL makes more sense.

site/docs/relations/logical_relations.md

westonpace · 2024-09-27T16:22:14Z

site/docs/relations/logical_relations.md

+| Minus (Primary)            | Returns all records from the primary input excluding any matching rows from secondary inputs, removing duplicate rows within or across sets.<br/>Each value is treated as a unique member of the set, so duplicates in the first set don’t affect the result. | MINUS<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{4}                                            | The same as the primary input.                                                                                                                                                                     
+| Minus (Primary All)        | Returns all records from the primary input excluding any matching records from secondary inputs.<br/>For each specific record returned, the output contains max(0, m - sum(n1, n2, …, n)) copies.                                                             | MINUS ALL<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 3}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 4}<br/>&nbsp;&nbsp;n2: {3}<br/>YIELDS<br/>{2, 3, 3}                            | The same as the primary input.
+| Minus (Multiset)           | Returns all records from the primary input excluding any records that are included in *all* secondary inputs.                                                                                                                                                 | MINUS MULTISET<br/>&nbsp;&nbsp;m:  {1, 2, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2}<br/>&nbsp;&nbsp;n2: {1, 2, 3}<br/>YIELDS<br/>{4}                                      | The same as the primary input.
+| Intersection (Primary)     | Returns all records from the primary input that are present in every secondary input, removing duplicate rows within or across sets.                                                                                                                          | INTERSECT<br/>&nbsp;&nbsp;m:  {1, 2, 2, 3, 3, 3, 4}<br/>&nbsp;&nbsp;n1: {1, 2, 3, 5}<br/>&nbsp;&nbsp;n2: {2, 3, 6}<br/>YIELDS<br/>{2, 3}                         | If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output. 


FWIW: Ah, I think I see now, the duplicate handling behavior of SET_OP_INTERSECTION_PRIMARY_ALL is ambiguous. I could interpret it as:

min(m, max(n1, n2, ..., n))

or...

min(m, sum(n1, n2, ..., n))

site/docs/relations/logical_relations.md

vbarua

Some minor comments and one nullability change, but otherwise this looks good to me.

vbarua

Thanks for the quick updated @kadinrabo

westonpace

One small nit but otherwise looks good. Thanks for doing this!

site/docs/relations/logical_relations.md

Co-authored-by: Weston Pace <[email protected]>

jacques-n

Looks good. Thanks for the helpful documentation!

feat: add a SetMode option to the SetRel spec to allow encoding of di…

cabae65

…stinct/all SQL modifiers

kadinrabo requested review from jacques-n, cpcloud, westonpace, EpsilonPrime and vbarua as code owners September 18, 2024 13:26

vbarua reviewed Sep 18, 2024

View reviewed changes

proto/substrait/algebra.proto Show resolved Hide resolved

kadinrabo added 2 commits September 18, 2024 13:55

deprecated flag

45013dc

buf lint failures

fd8c0c0

vbarua requested changes Sep 18, 2024

View reviewed changes

vbarua changed the title ~~feat: add a SetMode option to the SetRel spec to allow encoding of distinct/all SQL modifiers~~ feat: add ALL/DISTINCT modifier for all set operation types Sep 18, 2024

comments

d745059

jacques-n reviewed Sep 19, 2024

View reviewed changes

separate all/distinct enum values

638aa00

vbarua reviewed Sep 23, 2024

View reviewed changes

minus/intersect all according to sql spec

305b339

vbarua reviewed Sep 26, 2024

View reviewed changes

westonpace reviewed Sep 27, 2024

View reviewed changes

remove SET_OP_INTERSECTION_PRIMARY_ALL

79106bf

vbarua reviewed Sep 27, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

vbarua reviewed Sep 27, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

vbarua reviewed Sep 27, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

vbarua reviewed Sep 27, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

vbarua previously approved these changes Sep 27, 2024

View reviewed changes

typos

22c48f4

kadinrabo dismissed vbarua’s stale review via 22c48f4 September 27, 2024 18:56

vbarua requested review from jacques-n and westonpace September 27, 2024 19:04

vbarua previously approved these changes Sep 27, 2024

View reviewed changes

westonpace previously approved these changes Sep 27, 2024

View reviewed changes

site/docs/relations/logical_relations.md Outdated Show resolved Hide resolved

example typo

63f5e16

Co-authored-by: Weston Pace <[email protected]>

kadinrabo dismissed stale reviews from westonpace and vbarua via 63f5e16 September 27, 2024 20:17

vbarua approved these changes Sep 27, 2024

View reviewed changes

vbarua requested a review from westonpace September 27, 2024 21:15

westonpace approved these changes Sep 27, 2024

View reviewed changes

jacques-n approved these changes Sep 27, 2024

View reviewed changes

vbarua merged commit f796521 into substrait-io:main Sep 27, 2024
13 checks passed

kadinrabo mentioned this pull request Oct 2, 2024

chore: update to substrait v0.57.0 substrait-io/substrait-java#300

Merged

vbarua mentioned this pull request Oct 9, 2024

feat(substrait): add intersect support to consumer apache/datafusion#12830

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat: add ALL/DISTINCT modifier for all set operation types #708

feat: add ALL/DISTINCT modifier for all set operation types #708

kadinrabo commented Sep 18, 2024 •

edited

Loading

CLAassistant commented Sep 18, 2024 •

edited

Loading

vbarua left a comment

jacques-n commented Sep 19, 2024

EpsilonPrime commented Sep 19, 2024

westonpace commented Sep 19, 2024 •

edited

Loading

kadinrabo commented Sep 19, 2024

westonpace commented Sep 19, 2024

vbarua commented Sep 19, 2024

jacques-n Sep 19, 2024

kadinrabo Sep 20, 2024

kadinrabo Sep 20, 2024

jacques-n commented Sep 19, 2024

jacques-n commented Sep 19, 2024

kadinrabo commented Sep 20, 2024

vbarua left a comment

vbarua Sep 23, 2024

vbarua left a comment

vbarua Sep 26, 2024

westonpace Sep 27, 2024

westonpace Sep 27, 2024

vbarua Sep 27, 2024 •

edited

Loading

kadinrabo Sep 27, 2024

vbarua Sep 27, 2024

westonpace left a comment

westonpace Sep 27, 2024

westonpace Sep 27, 2024

vbarua left a comment

vbarua left a comment

westonpace left a comment

jacques-n left a comment

	## Set Operation

	The set operation encompasses several set-level operations that support combining datasets, possibly excluding records based on various types of record level matching.

	\| Signature \| Value \|
	\| -------------------- \| ------------------------------------------------------------ \|
	\| Inputs \| 2 or more \|
	\| Outputs \| 1 \|
	\| Property Maintenance \| Maintains distribution if all inputs have the same ordinal distribution. Orderedness is not maintained. \|
	\| Direct Output Order \| The field order of the inputs. All inputs must have identical field types, but field nullabilities may vary. \|

	### Set Properties

	\| Property \| Description \| Required \|
	\| ------------------ \| --------------------------------- \| --------------------- \|
	\| Primary Input \| The primary input of the dataset. \| Required \|
	\| Secondary Inputs \| One or more relational inputs. \| At least one required \|
	\| Set Operation Type \| From list below. \| Required \|

	### Set Operation Types

	The set operation type determines both the records that are emitted and the type of the output record.

	\| Property \| Description \| Output Nullability
	\| ----------------------- \| ------------------------------------------------------------------------------------------------------------- \| ----------------------------- \|
	\| Minus (Primary) \| Returns all records from the primary input excluding any matching records from secondary inputs. \| The same as the primary input.
	\| Minus (Multiset) \| Returns all records from the primary input excluding any records that are included in all secondary inputs. \| The same as the primary input.
	\| Intersection (Primary) \| Returns all records from the primary input that match at least one record from any secondary inputs. \| If a field is nullable in the primary input and in any of the secondary inputs, it is nullable in the output.
	\| Intersection (Multiset) \| Returns all records from the primary input that match at least one record from all secondary inputs. \| If a field is required in any of the inputs, it is required in the output.
	\| Union Distinct \| Returns all the records from each set, removing any rows that are duplicated (within or across sets). \| If a field is nullable in any of the inputs, it is nullable in the output.
	\| Union All \| Returns all records from each set, allowing duplicates. \| If a field is nullable in any of the inputs, it is nullable in the output. \|

feat: add ALL/DISTINCT modifier for all set operation types #708

feat: add ALL/DISTINCT modifier for all set operation types #708

Conversation

kadinrabo commented Sep 18, 2024 • edited Loading

CLAassistant commented Sep 18, 2024 • edited Loading

vbarua left a comment

Choose a reason for hiding this comment

jacques-n commented Sep 19, 2024

EpsilonPrime commented Sep 19, 2024

westonpace commented Sep 19, 2024 • edited Loading

kadinrabo commented Sep 19, 2024

westonpace commented Sep 19, 2024

vbarua commented Sep 19, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

jacques-n commented Sep 19, 2024

jacques-n commented Sep 19, 2024

kadinrabo commented Sep 20, 2024

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua Sep 27, 2024 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

vbarua left a comment

Choose a reason for hiding this comment

westonpace left a comment

Choose a reason for hiding this comment

jacques-n left a comment

Choose a reason for hiding this comment

kadinrabo commented Sep 18, 2024 •

edited

Loading

CLAassistant commented Sep 18, 2024 •

edited

Loading

westonpace commented Sep 19, 2024 •

edited

Loading

vbarua Sep 27, 2024 •

edited

Loading