Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Support GroupsAccumulator accumulator for User Defined Functions #8793

Closed
alamb opened this issue Jan 8, 2024 · 3 comments · Fixed by #8892
Closed

Support GroupsAccumulator accumulator for User Defined Functions #8793

alamb opened this issue Jan 8, 2024 · 3 comments · Fixed by #8892
Labels
enhancement New feature or request

Comments

@alamb
Copy link
Contributor

alamb commented Jan 8, 2024

Is your feature request related to a problem or challenge?

Many of DataFusion's built in accumlators use the GroupsAccumulator interface for a significant performance improvement

See the blog post for more details

However, it is not currently possible to use [GroupsAccumulator] with user defined Aggregate functions , which is less than ideal for 2 reasons:

  1. User's can't take advantage of the faster interface
  2. We can't port built in Aggregate functions to use the same API [Epic] Unify AggregateFunction Interface (remove built in list of AggregateFunction s), improve the system #8708

Describe the solution you'd like

I would like to extend the AggregateUDFImpl trait added by @guojidan in #8733 with a new method that returns a GroupsAccumulator

  1. Add new methods to AggregateUDFImpl
  2. Wire in the the call into the impl for AggregateExpr https://github.com/apache/arrow-datafusion/blob/dd58c5a7b0069e3bade997abad9ec6bed8c8a1c3/datafusion/physical-plan/src/udaf.rs#L75
  3. Update advanced_udaf.rs example to create a GroupsAccumulator

Adding new methods to AggregateUDFImpl:

Following the model of PartitionEvaluator and groups_accumulator_supported

pub trait AggregateUDFImpl {
...
    /// If the aggregate expression has a specialized
    /// [`GroupsAccumulator`] implementation. If this returns true,
    /// `[Self::create_groups_accumulator`] will be called.
    fn groups_accumulator_supported(&self) -> bool {
        false
    }

   /// Return a specialized [`GroupsAccumulator`] that manages state
    /// for all groups.
    ///
    /// For maximum performance, a [`GroupsAccumulator`] should be
    /// implemented in addition to [`Accumulator`].
    fn create_groups_accumulator(&self) -> Result<Box<dyn GroupsAccumulator>> {
        not_impl_err!("GroupsAccumulator hasn't been implemented for {self:?} yet")
    }

}

Describe alternatives you've considered

No response

Additional context

No response

@guojidan
Copy link
Contributor

guojidan commented Jan 9, 2024

I will deal this issue after #8733 finished and merged

@guojidan
Copy link
Contributor

GroupsAccumulator is in datafusion-physical-expr package, datafusion-physical-expr package depend datafusion-expr package, so if we use GroupsAccumulator in datafusion-expr package will cause reference loops 😢

@alamb
Copy link
Contributor Author

alamb commented Jan 12, 2024

GroupsAccumulator is in datafusion-physical-expr package, datafusion-physical-expr package depend datafusion-expr package, so if we use GroupsAccumulator in datafusion-expr package will cause reference loops 😢

Is it possible to move GroupsAccumulator into datafusion_expr (in the same place as Accumulator)?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
None yet
Development

Successfully merging a pull request may close this issue.

2 participants