-
Notifications
You must be signed in to change notification settings - Fork 1.3k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
doc: why nullable of list item is set to true #11626
doc: why nullable of list item is set to true #11626
Conversation
Notes:
Please recommend changes to copy or critique this approach. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
tbh I would be putting this to trait object instead of copying through methods.
Do you mean add the documentation to https://docs.rs/datafusion/latest/datafusion/logical_expr/trait.AggregateUDFImpl.html#method.state_fields ? |
yep, it might be a better place imho. Especially if it related to all function implementations |
This does not belong in trait object because this affects only a few aggregate functions, not all of them. For some aggregate functions the intermediate accumulator state often has:
Field::new_list(
format_state_name(args.name, "distinct_array_agg"),
Field::new("item", args.input_type.clone(), true), // [1] should always be true
true, // [2] or false
) At first glance it looked like nullable of the list item should be configurable. There were also multiple PRs in this direction before we realized it was all unnecessary. To get rid of the comment duplication, I'll instead move them to a markdown doc within |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks like an improvement to me -- thank you @jcsherin and @jayzhan211 and @comphead
I also marked this PR ready for review as it looks good to me
@@ -203,6 +203,7 @@ impl AggregateUDFImpl for BitwiseOperation { | |||
args.name, | |||
format!("{} distinct", self.name()).as_str(), | |||
), | |||
// See COMMENTS.md to understand why nullable is set to true |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
|
||
## Computing Intermediate State | ||
|
||
By setting `nullable` to be always `true` like this we ensure that the |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is another rationale that the intermediate results need to be able to represent "saw no rows" (e.g that partition had no values)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For nth_value
accumulator, when now rows are present in the partition, then no values are added to the intermediate state.
I haven't checked the other aggregates though. So I don't know for certain if this is the case always. I'll verify and make a follow-on PR if any differences exist. I think we've only looked deeper into nth_value
and array_agg
(by @jayzhan211) at the moment.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Makes sense -- I vaguely remember that the null was needed in one of the aggregators to distinguish between
- only empty lists had been seen
[]
- No lists at all had been seen
NULL
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
@alamb Thanks for the pointer. I'll keep this in mind while making pass through the aggregates next time.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I made a minor copy change to disambiguate that the "Computing Intermediate State" section is talking about the nullability of the list item rather than the nullability of the list container.
Sorry for the confusion. I was not clear earlier.
I pushed a minor copy edit and CI failed. Looking at the error logs it looks to me like it is not related to this change. |
The clippy errors in CI are being tracked here - #11651. |
I merged up from main to get the fix for the clippy errors |
@alamb Thank you. |
In `array_agg` the list is nullable, so changed the example to `nth_value` where the list is not nullable to be correct.
Thanks @jcsherin |
Thanks again - we can iterate on the docs in follow on PRs if there is more to do |
Thanks for the review feedback - @alamb, @comphead and for prior discussions @jayzhan211. |
Which issue does this PR close?
Closes #11625.
Rationale for this change
When working on issues related to #8708 there have been multiple PRs which dealt with the nullability of list item in accumulator state. This doc patch makes the reasoning of existing code explicit.
What changes are included in this PR?
Only doc comments are added. There are no code changes.
Aggregate functions which use data type of first argument:
ArrayAgg
NthValueAgg
Count
Aggregate functions which use data type of returned value:
BitwiseOperation
Sum
Are these changes tested?
Are there any user-facing changes?