You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem or challenge?
Today, statistics of filter predicates are based on interval arithmetic invoked by PhysicalExec::evaluate_bounds(). This works fine for numerical data. However, many expressions and datatypes are not supported by interval arithmetics and therefore proper selectivity prediction is not supported for such expressions.
I noticed there were lots of discussions regarding statistics in the project lately. Work by folks from Synnada and others is currently in progress. If you feel like this issue is already addressed please let me know. I'd like to offer help with open tasks then.
Describe the solution you'd like
Add support for some missing stuff in interval arithmetics, i.e., temporal data.
Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.
I think this should return a new statistics struct on expression level which could look like this:
pubstructExpressionStatistics{/// Number of null valuespubnull_count:Precision<usize>,/// number of output rows (cardinality)pubnum_rows:Precision<ScalarValue>,/// total number of input rows pubtotal_rows:Precision<ScalarValue>,/// Number of distinct valuespubdistinct_count:Precision<usize>,}
With evaluate_statistics() we add support for filter expressions such as string comparisons, InList, LikeExpr, or binary operators like IS_DISTINCT_FROM, IS_NOT_DISTINCT_FROM. It may be an iterative approach where we start with a few expression types and take it from there.
Selectivity calculation is trivial: num_rows/total_rows.
We can utilise evaluate_bounds() for supported expressions. For example, from 2*A > B we get its target boundaries and calculate the selectivity as is done in analysis::calculate_selectivity().
fncalculate_selectivity(target_boundaries:&[ExprBoundaries],initial_boundaries:&[ExprBoundaries],) -> f64{// Since the intervals are assumed uniform and the values// are not correlated, we need to multiply the selectivities// of multiple columns to get the overall selectivity.
initial_boundaries
.iter().zip(target_boundaries.iter()).fold(1.0, |acc,(initial, target)| {
acc *cardinality_ratio(&initial.interval,&target.interval)})}
This naive approach assumes uni-distributed data. Heuristics, like various distribution types, could be added to ExpressionStatisticsa too. For the sake of simplicity I will not address this here.
Happy to receive some feedback 🙂
Describe alternatives you've considered
No response
Additional context
Short disclaimer: I work for Coralogix like some other datafusion contributors.
Hi @ch-sc. I try to address your solution suggestions:
Add support for some missing stuff in interval arithmetics, i.e., temporal data.
I highly recommend completing support for all common and applicable data types in interval arithmetic. This would resolve many optimization challenges. You, or anyone interested, can work further on adding this support, and I will certainly assist as much as I can.
Add PhysicalExpr::evaluate_statistics() to calculate expression level statistic. This was already proposed by others.
With our new tools for statistics, implementing such an API will be straightforward since we will handle all types of expression evaluations (that's the hardest part). I believe we can have this ready by next week. Of course, we welcome any improvements and additions to it.
In short, the evaluation mechanism you’re looking for is under way, and we can continue on the discussion after seeing its ready state.
Is your feature request related to a problem or challenge?
Today, statistics of filter predicates are based on interval arithmetic invoked by
PhysicalExec::evaluate_bounds()
. This works fine for numerical data. However, many expressions and datatypes are not supported by interval arithmetics and therefore proper selectivity prediction is not supported for such expressions.I noticed there were lots of discussions regarding statistics in the project lately. Work by folks from Synnada and others is currently in progress. If you feel like this issue is already addressed please let me know. I'd like to offer help with open tasks then.
Describe the solution you'd like
Add support for some missing stuff in interval arithmetics, i.e., temporal data.
Add
PhysicalExpr::evaluate_statistics()
to calculate expression level statistic. This was already proposed by others.My suggestion is the following signature:
I think this should return a new statistics struct on expression level which could look like this:
With
evaluate_statistics()
we add support for filter expressions such as string comparisons,InList
,LikeExpr
, or binary operators likeIS_DISTINCT_FROM
,IS_NOT_DISTINCT_FROM
. It may be an iterative approach where we start with a few expression types and take it from there.Selectivity calculation is trivial:
num_rows/total_rows
.We can utilise
evaluate_bounds()
for supported expressions. For example, from2*A > B
we get its target boundaries and calculate the selectivity as is done inanalysis::calculate_selectivity()
.This naive approach assumes uni-distributed data. Heuristics, like various distribution types, could be added to
ExpressionStatisticsa
too. For the sake of simplicity I will not address this here.Happy to receive some feedback 🙂
Describe alternatives you've considered
No response
Additional context
Short disclaimer: I work for Coralogix like some other datafusion contributors.
cc: @thinkharderdev
The text was updated successfully, but these errors were encountered: