-
Notifications
You must be signed in to change notification settings - Fork 158
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
feat: enhance VirtualTable to have expression as value #711
feat: enhance VirtualTable to have expression as value #711
Conversation
ACTION NEEDED Substrait follows the Conventional Commits The PR title and description are used as the merge commit message. Please update your PR title and description to match the specification. |
14eb28c
to
a0fc6ab
Compare
Patch makes sense. We should update the description message in this PR to also explain the deprecation. |
proto/substrait/algebra.proto
Outdated
@@ -91,6 +92,11 @@ message ReadRel { | |||
repeated Expression.Literal.Struct values = 1; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Rather than deprecate VirtualTable
and create VirtualExpressionTable
, would we consider deprecating the values
field and adding a new expressions
field?
edit: I'm not a fan of Protobuf quirks becoming load-bearing, but this seems to indicate a field can be replaced with a oneof
while maintaining binary compatibility. 😅
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
edit: consider this retracted
Expanding on this, I'm almost inclined to have both? Like
message VirtualTable {
repeated Expression.Literal.Struct values = 1;
repeated Expression expression = 2;
}
which emits first the values
and then expressions
. Makes the common case really easy to produce (literals only).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm not a fan of this. It creates this confusing situation where you have something like
literal1, literal2, expression1, literal3
and it gets stored as
values: [literal1, literal2]
expression: [expresion1, literal3]
> Rather than deprecate VirtualTable and create VirtualExpressionTable, would we consider deprecating the values field and adding a new expressions field?
I don't see how that is better (or clearer for someone new to the format).
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
It looks like we have a protobuf check set up so if we can't use the oneof it will let us know. I do know that it is less permissive than I'm used to:
https://github.com/substrait-io/substrait/blob/main/.github/workflows/pr.yml#L35
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
protobuf oneof can't have repeated fields link
A oneof cannot be repeated.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We can't add oneof as mentioned above so both fields can contain values
I too share concern with @jacques-n having both fields where initial literals are in Expression_Literal and rest in Expression (assuming there is one expression) doesn't seem intuitive .
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't have a strong opinion on "VirtualTable with expressions field" vs. "VirtualExpressionTable as new relation" but if I had to pick I would go with @acruise 's suggestion (just add a new field, not a new relation).
A virtual expression table is a table whose contents are embedded in the plan itself. The table data | ||
is encoded as records consisting of expression values. | ||
|
||
| Property | Description | Required | | ||
| -------- | ----------- | -------- | | ||
| Data | Required | Required | |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think an important property of a virtual (expression) table is that it doesn't contain any field references. This may not be obvious to new readers to Substrait. In other words, they may look at a virtual table and think "this is weird, it looks like a projection? Maybe this is some kind of select relation?"
Maybe something like...
A virtual expression table is a table whose contents are embedded in the plan itself. Each column is an expression that can be resolved without referencing any input data. These expressions will not contain field references but may contain function calls, literals, and other expression nodes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Adding to this, I assume that these expression shouldn't contain other relations (i.e. subqueries)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have prohibitions against either (field references and subqueries).
That being said, a valid field reference in this context is impossible because there are zero fields to reference. So having field[47] here is just as illegal as having a readrel that has two columns and trying to do field[47].
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
As a fun aside, in Snowflake the following is legal.
select * from (values (
(select 1 where (
select true)
),
'Hello'||'World')
)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The problem with nested queries having a field reference is that the resulting structure is no longer scalar. For systems without virtual table support I have had to write tables to disk and then register named tables. For those systems I would have to reject field references.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think we should have prohibitions against either (field references and subqueries).
That's fine. I don't know that we need to go so far as prohibit things. My goal is just to make it as obvious as possible what the message is used for. I'd be find with wording like...
A virtual expression table is a table whose contents are embedded in the plan itself.
Each column is an expression that can be resolved without referencing any input data.
For example, a literal, a function call involving literals, or any other expression that does
not require input.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'd be find with wording like
Fixed doc to indicate virtual expression table are not dependent on any input
I believe that only works for optional/required fields. I believe repeated can be in a oneof because of how they are serialized. |
This enables expressing Virtual table having values as expression Also marked VirtualTable as deprecated
a0fc6ab
to
d6f4a9a
Compare
…dent on any input
d6f4a9a
to
b2e3d6b
Compare
|
I feel like the main outstanding question is whether we create new table type versus adding field to existing and deprecating old fields. @westonpace, @vbarua and @EpsilonPrime, thoughts? It's subjective but I'm definitely a fan of keeping the two table types separate so nobody can make a mistake and set both fields. |
My opinion hasn't changed. I'll approve either approach. If I need to take a side I still prefer one relation. We have, several times now, deprecated fields and not worried too much about users setting both values. I think confusion about "why are there two relations that seem to do the same thing" is going to be more confusing than "why are there two fields, one which is marked deprecated, and one that isn't". |
I'm fine with @westonpace's preferred change. @anshuldata can you update to single message with two sets of repeated fields? |
…teral field as deprecated
Fixed as suggested. Basically to add another field in VirtualTable instead of another Relation type |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given the constraints of proto compatibility, this seems like as good as it's going to get. Thanks for working on this @anshuldata.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added some minor comments as well which I don't consider blocking.
@anshuldata Hi! Trying to adapt the changes in datafusion. It's not clear how to handle multiple rows in values, eq |
@akoshchiy upd: my bad, never mind. didn't notice the pluses. I'm guessing it should be |
I think this is a really good question actually. with the Wouldn't it make have made more sense in the first place for the new field to have been |
Yes, agree. I missed this in my original change. I will raise a PR to fix it |
select * from (values (1+2, 'Hello'||'World'))