-
Notifications
You must be signed in to change notification settings - Fork 1.2k
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Convert nth_value
builtIn function to User Defined Window Function
#13201
Conversation
THis is so exciting. FYI @jonathanc-n and @Omega359 |
I personally think it would be fine to leave Perhaps you can leave a stub in like enum BuiltInWindowFunction {
// Never created, will be removed in a follow on PR
Stub
}; Then we can focus this PR on making sure that |
Thanks @alamb will continue with what you've said |
8cf82f0
to
fda6a6f
Compare
Wanted to update here. I think I'm almost finished but probably encountered a side effect. This query fails in slt file:
I hope to fix this and make this ready tomorrow |
In the built-in (older) version the output field is defined like: fn field(&self) -> Result<Field> {
let nullable = true;
Ok(Field::new(&self.name, self.data_type.clone(), nullable))
} In the current code, the data type of the field is hard-coded as fn field(&self, field_args: WindowUDFFieldArgs) -> Result<Field> {
let nullable = true;
Ok(Field::new(field_args.name(), DataType::UInt64, nullable))
} To fix this use |
0706334
to
fddbc58
Compare
Thanks, @jcsherin that was the fix. I've fixed that issue but encountered another one. I return Error from partition evaluator but I think it is not honored.
But it should not succeed since:
|
TL;DRFor invalid input expressions, built-in window functions fail early when converting logical plan to physical plan. But user-defined window functions will complete planning, and fail only during physical execution. Validation of input expressions in user-defined window runs only during physical execution. In this case is it not better for udwf to fail early when converting to physical plan? A possible solution is to update datafusion/datafusion/physical-plan/src/windows/mod.rs Lines 158 to 164 in b61b2fc
Edge Case: Empty TableDataFusion CLI v42.2.0
> CREATE TABLE t1(v1 BIGINT);
0 row(s) fetched.
Elapsed 0.020 seconds. There are currently no rows in datafusion/datafusion/physical-plan/src/windows/window_agg_exec.rs Lines 319 to 321 in 89e96b4
The > SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+-------------------------------------------------------------------------------------------------------------+
| nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
+-------------------------------------------------------------------------------------------------------------+
+-------------------------------------------------------------------------------------------------------------+
0 row(s) fetched.
Elapsed 0.018 seconds. After we insert a few values into > insert into t1 values (123), (456);
+-------+
| count |
+-------+
| 2 |
+-------+
1 row(s) fetched.
Elapsed 0.007 seconds.
> SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
This feature is not implemented: There is only support Literal types for field at idx: 1 in Window Function Planning divergence between built-in & user-defined window functionsIn > EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: NTH_VALUE(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[NTH_VALUE(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS NTH_VALUE(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.009 seconds. But this is not the case for user-defined window functions. In this branch we instead see that a complete plan is built and failure is happening only when the query executes, > EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[nth_value(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
| physical_plan | ProjectionExec: expr=[nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING@1 as nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING] |
| | WindowAggExec: wdw=[nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING: Ok(Field { name: "nth_value(Utf8(\"+Inf\"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING", data_type: Float64, nullable: true, dict_id: 0, dict_is_ordered: false, metadata: {} }), frame: WindowFrame { units: Rows, start_bound: Preceding(UInt64(NULL)), end_bound: Following(UInt64(NULL)), is_causal: false }] |
| | SortExec: expr=[v1@0 ASC NULLS LAST], preserve_partitioning=[false] |
| | MemoryExec: partitions=1, partition_sizes=[0] |
| | |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.019 seconds. |
@jcsherin thanks for the very detailed explanation. In this case, I think it would be better to update WindowUDFImpl in a followup PR for enhancement right? I can skip this test case in the scope of this PR. Correct me if I'm wrong please |
Sure, we can improve the API in another PR. Here is a workaround that fixes the failing test: // In datafusion/physical-plan/src/windows/mod.rs
fn create_udwf_window_expr(
fun: &Arc<WindowUDF>,
args: &[Arc<dyn PhysicalExpr>],
input_schema: &Schema,
name: String,
ignore_nulls: bool,
) -> Result<Arc<dyn BuiltInWindowFunctionExpr>> {
// need to get the types into an owned vec for some reason
let input_types: Vec<_> = args
.iter()
.map(|arg| arg.data_type(input_schema))
.collect::<Result<_>>()?;
let udwf_expr =
Arc::new(WindowUDFExpr {
fun: Arc::clone(fun),
args: args.to_vec(),
input_types,
name,
is_reversed: false,
ignore_nulls,
});
/// Early validation of input expressions
///
/// We create a partition evaluator because in the user-defined window
/// implementation this is where code for parsing input expressions
/// exist. The benefits are:
/// - If any of the input expressions are invalid we catch them early
/// in the planning phase, rather than during execution.
/// - Maintains compatibility with built-in (now removed) window
/// functions validation behavior.
/// - Predictable and reliable error handling.
///
/// See discussion here:
/// https://github.com/apache/datafusion/pull/13201#issuecomment-2454209975
let _ = udwf_expr.create_evaluator()?;
Ok(udwf_expr)
} I verified that this works in your branch. DataFusion CLI v42.2.0
> CREATE TABLE t1(v1 BIGINT);
0 row(s) fetched.
Elapsed 0.019 seconds.
> EXPLAIN SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type | plan |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan | Projection: nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING |
| | WindowAggr: windowExpr=[[nth_value(Float64(inf), t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING AS nth_value(Utf8("+Inf"),t1.v1) PARTITION BY [t1.v1] ROWS BETWEEN UNBOUNDED PRECEDING AND UNBOUNDED FOLLOWING]] |
| | TableScan: t1 projection=[v1] |
+--------------+----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
1 row(s) fetched.
Elapsed 0.018 seconds.
> SELECT NTH_VALUE('+Inf'::Double, v1) OVER (PARTITION BY v1) FROM t1;
This feature is not implemented: There is only support Literal types for field at idx: 1 in Window Function This workaround may not be ideal, but at least we do not have to skip this test. Also please feel free to update the code/comments as you see fit. |
42bb690
to
aa18c1d
Compare
|
||
/// Create an expression to represent the `nth_value` window function | ||
/// | ||
pub fn nth_value(arg: datafusion_expr::Expr, n: Option<i64>) -> datafusion_expr::Expr { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The type of n
is i64
, not Option<i64>
.
See the rust docs: https://docs.rs/datafusion/latest/datafusion/logical_expr/window_function/fn.nth_value.html
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Also add a roundtrip logical plan test for this API here:
cume_dist(), |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Fixed and added test
@buraksenn Tremendous effort 🙌. These changes look good to me. |
Co-authored-by: Sherin Jacob <[email protected]>
@buraksenn and @berkaysynnada Thanks! @alamb This PR is ready. |
Awesome -- thank you so much. I will review this PR hopefully later today |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thank you so much @buraksenn , @jcsherin -- it is just so beautiful to see this PR now after all the work. It is basically perfect from my perspective 🏆
@@ -70,6 +70,7 @@ tokio = { workspace = true } | |||
[dev-dependencies] | |||
criterion = { version = "0.5", features = ["async_futures"] } | |||
datafusion-functions-aggregate = { workspace = true } | |||
datafusion-functions-window = { workspace = true } |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
some day I hope we can remove these dependencies (so we can make testing physical-plan
faster, but not a part of this PR
// We create a partition evaluator because in the user-defined window | ||
// implementation this is where code for parsing input expressions | ||
// exist. The benefits are: | ||
// - If any of the input expressions are invalid we catch them early |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
💯 for these comments that explain the rationale
nth_value
builtIn function to User Defined Window Function
I also took the liberty of merging up from main to make sure we haven't hit any logical conflicts with this PR |
I don't think there is any reason to wait around for this PR -- people know it is coming, so let's get this in 🚀 |
Which issue does this PR close?
Closes #12649
Rationale for this change
Context: #8709
What changes are included in this PR?
Are these changes tested?
Are there any user-facing changes?
no