Do not sort rows in `FirstValueAccumulator` #14402

blaginin · 2025-02-01T15:26:56Z

Which issue does this PR close?

Rationale for this change

Right now, merging / updating batches in first_value / last_value we sometimes sort input to find just one max / min row. In some cases, we request to sort only top-1 value but even that produces extra index buffers...

What changes are included in this PR?

Instead of sorting arrays, just finding min / max directly.

Are these changes tested?

Yes, by aggregate.slt. Also added new benches to illustrate the speed improvement

Are there any user-facing changes?

No

blaginin · 2025-02-01T15:27:51Z

group                                                main                                    new_lexcmp
first_last_ignore_nulls                              2.02      4.3±0.23ms        ? ?/sec     1.00      2.1±0.08ms        ? ?/sec
first_last_many_columns                              1.12      2.3±0.12ms        ? ?/sec     1.00      2.0±0.08ms        ? ?/sec
first_last_one_column                                1.20  1907.2±49.72µs        ? ?/sec     1.00  1584.4±47.60µs        ? ?/sec

blaginin · 2025-02-01T15:31:10Z

datafusion/functions-aggregate/src/first_last.rs

-            let indices = lexsort_to_indices(&sort_cols, None)?;
-            take_arrays(&filtered_states, &indices, None)?
-        };
+        let comparator = LexicographicalComparator::try_new(&sort_columns)?;


In lexsort_to_indices there are some additional optimizations when there's just one column - so I also wrote a version which reuses lexsort_to_indices - but the speed increase there is actually 20% smaller than here

blaginin · 2025-02-01T15:32:39Z

datafusion/sqllogictest/test_files/group_by.slt

@@ -3003,7 +3003,7 @@ SELECT FIRST_VALUE(amount ORDER BY ts ASC) AS fv1,
  LAST_VALUE(amount ORDER BY ts ASC) AS fv2
  FROM sales_global
 ----
-30 100
+30 80


IMO, it actually makes more sense because that line has the same ORDER BY value but appears last in the table.

blaginin · 2025-02-01T15:33:25Z

fyi @jayzhan211

jayzhan211 · 2025-02-02T07:13:21Z

datafusion/functions-aggregate/src/first_last.rs

-        }
+        let comparator = LexicographicalComparator::try_new(&sort_columns)?;
+        let best = (0..value.len())
+            .filter(|&index| !(self.ignore_nulls && value.is_null(index)))


If we make 'ignore nulls' a constant generic, it might be faster than we can avoid filter logic for null if we don't ignore nulls

Switch to LexicographicalComparator

00fdac7

github-actions bot added core Core DataFusion crate sqllogictest SQL Logic Tests (.slt) functions labels Feb 1, 2025

blaginin commented Feb 1, 2025

View reviewed changes

jayzhan211 requested a review from korowa February 2, 2025 07:07

jayzhan211 reviewed Feb 2, 2025

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Do not sort rows in `FirstValueAccumulator` #14402

Do not sort rows in `FirstValueAccumulator` #14402

blaginin commented Feb 1, 2025

blaginin commented Feb 1, 2025

blaginin Feb 1, 2025

blaginin Feb 1, 2025

blaginin commented Feb 1, 2025

jayzhan211 Feb 2, 2025 •

edited

Loading

Do not sort rows in FirstValueAccumulator #14402

Are you sure you want to change the base?

Do not sort rows in FirstValueAccumulator #14402

Conversation

blaginin commented Feb 1, 2025

Which issue does this PR close?

Rationale for this change

What changes are included in this PR?

Are these changes tested?

Are there any user-facing changes?

blaginin commented Feb 1, 2025

blaginin Feb 1, 2025

Choose a reason for hiding this comment

blaginin Feb 1, 2025

Choose a reason for hiding this comment

blaginin commented Feb 1, 2025

jayzhan211 Feb 2, 2025 • edited Loading

Choose a reason for hiding this comment

Do not sort rows in `FirstValueAccumulator` #14402

Do not sort rows in `FirstValueAccumulator` #14402

jayzhan211 Feb 2, 2025 •

edited

Loading