Skip to content

Commit

Permalink
Support vectorized append and compare for multi group by (apache#12996)
Browse files Browse the repository at this point in the history
* simple support vectorized append.

* fix tests.

* some logs.

* add `append_n` in `MaybeNullBufferBuilder`.

* impl basic append_batch

* fix equal to.

* define `GroupIndexContext`.

* define the structs useful in vectorizing.

* re-define some structs for vectorized operations.

* impl some vectorized logics.

* impl chekcing hashmap stage.

* fix compile.

* tmp

* define and impl `vectorized_compare`.

* fix compile.

* impl `vectorized_equal_to`.

* impl `vectorized_append`.

* finish the basic vectorized ops logic.

* impl `take_n`.

* fix `renaming clear` and `groups fill`.

* fix death loop due to rehashing.

* fix vectorized append.

* add counter.

* use extend rather than resize.

* remove dbg!.

* remove reserve.

* refactor the codes to make simpler and more performant.

* clear `scalarized_indices` in `intern` to avoid some corner case.

* fix `scalarized_equal_to`.

* fallback to total scalarized `GroupValuesColumn` in streaming aggregation.

* add unit test for `VectorizedGroupValuesColumn`.

* add unit test for emitting first n in `VectorizedGroupValuesColumn`.

* sort out tests codes in for group columns and add vectorized tests for primitives.

* add vectorized test for byte builder.

* add vectorized test for byte view builder.

* add test for the all nulls or not nulls branches in vectorized.

* fix clippy.

* fix fmt.

* fix compile in rust 1.79.

* improve comments.

* fix doc.

* add more comments to explain the really complex vectorized intern process.

* add comments to explain why we still need origin `GroupValuesColumn`.

* remove some stale comments.

* fix clippy.

* add comments for `vectorized_equal_to` and `vectorized_append`.

* fix clippy.

* use zip to simplify codes.

* use izip to simplify codes.

* Update datafusion/physical-plan/src/aggregates/group_values/group_column.rs

Co-authored-by: Jay Zhan <[email protected]>

* first_n attempt

Signed-off-by: jayzhan211 <[email protected]>

* add test

Signed-off-by: jayzhan211 <[email protected]>

* improve hashtable modifying in emit first n test.

* add `emit_group_index_list_buffer` to avoid allocating new `Vec` to store the remaining gourp indices.

* make comments in VectorizedGroupValuesColumn::intern simpler and clearer.

* define `VectorizedOperationBuffers` to hold buffers used in vectorized operations to make code clearer.

* unify `VectorizedGroupValuesColumn` and `GroupValuesColumn`.

* fix fmt.

* fix comments.

* fix clippy.

---------

Signed-off-by: jayzhan211 <[email protected]>
Co-authored-by: Jay Zhan <[email protected]>
  • Loading branch information
Rachelint and jayzhan211 authored Nov 6, 2024
1 parent c3a9847 commit 345117b
Show file tree
Hide file tree
Showing 9 changed files with 2,296 additions and 227 deletions.
2 changes: 1 addition & 1 deletion datafusion/common/src/utils/memory.rs
Original file line number Diff line number Diff line change
Expand Up @@ -102,7 +102,7 @@ pub fn estimate_memory_size<T>(num_elements: usize, fixed_size: usize) -> Result

#[cfg(test)]
mod tests {
use std::collections::HashSet;
use std::{collections::HashSet, mem::size_of};

use super::estimate_memory_size;

Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -19,6 +19,7 @@
//! user defined aggregate functions
use std::hash::{DefaultHasher, Hash, Hasher};
use std::mem::{size_of, size_of_val};
use std::sync::{
atomic::{AtomicBool, Ordering},
Arc,
Expand Down
Loading

0 comments on commit 345117b

Please sign in to comment.