Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: implement StringColumn using StringViewArray #16610

Merged
merged 89 commits into from
Nov 8, 2024

Conversation

andylokandy
Copy link
Collaborator

@andylokandy andylokandy commented Oct 15, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Use StringViewArray to replace StringArray in memory column format

Performance of string view in kernels:

image

image

image

filter benchmark scripts:

create or replace table test_filter (
 ss string not null comment 'short_str',   
 ms string not null comment 'medium_str',  
 ls string not null comment 'long_str',    
 number int64 not null
) Engine = Memory;


insert into test_filter select repeat('a',  number % 12 + 1), repeat('a',  number % 200 + 1),  repeat( substr( rand()::string, 3, 1) ,  500), number from numbers(10000000);


strs=("ss" "ms" "ls")
for str in "${strs[@]}"; do
  for i in 5 50 95; do
    echo "select $str from test_filter where number % 100 <= $i ignore_result;" | bendsql --time
  done
done

compact

cargo bench -- "take_compact_"

compact

cargo bench -- "concat_"

Performance of comparison should be improved in another PR.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

depends on arrow-udf/arrow-udf#78

@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Oct 15, 2024
@sundy-li
Copy link
Member

need to finish the datatype matches.

#[cfg(feature = "arrow")]
impl From<DataType> for arrow_schema::DataType {

#[cfg(feature = "arrow")]
impl From<arrow_schema::DataType> for DataType {

@sundy-li sundy-li removed the ci-benchmark Benchmark: run all test label Nov 6, 2024
@sundy-li sundy-li added the ci-benchmark Benchmark: run all test label Nov 7, 2024
@sundy-li sundy-li marked this pull request as ready for review November 7, 2024 11:42
@databendlabs databendlabs deleted a comment from github-actions bot Nov 7, 2024
@databendlabs databendlabs deleted a comment from github-actions bot Nov 7, 2024
@sundy-li sundy-li added ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits and removed ci-benchmark Benchmark: run all test labels Nov 8, 2024
Copy link
Contributor

github-actions bot commented Nov 8, 2024

Docker Image for PR

  • tag: pr-16610-5ded4ab-1731028380

note: this image tag is only available for internal use,
please check the internal doc for more details.

1 similar comment
@sundy-li
Copy link
Member

sundy-li commented Nov 8, 2024

hits: benchmark.databend.com/clickbench/pr/16610/11733802627/hits.html
tpch: benchmark.databend.com/clickbench/pr/16610/11733802627/tpch.html

Rerun the TPCH tests in large warehouse with (CI and bendsql).
The previous result is not stable due to env.

This pr is large enough and did not introduce breaking changes, let's merge it and imporve later.

@sundy-li sundy-li removed the ci-benchmark-cloud Benchmark: run only cloud tests for tpch/hits label Nov 8, 2024
@sundy-li sundy-li added this pull request to the merge queue Nov 8, 2024
github-merge-queue bot pushed a commit that referenced this pull request Nov 8, 2024
* feat: implement StringColumn using StringViewArray

* fix

* convert binaryview between arrow1 and arrow2

* fix

* fix

* fix

* fix

* fix

* fix some issue

* fix view slice bug

* fix view slice bug

* fix

* support native read write

* fix

* fix

* fix tests

* add with_data_type

* add with_data_type

* fix gen_random_uuid commit row

* move record batch to block

* remove unused dep

* fix lint

* fix commit row

* fix commit row

* fix size

* fix size

* add NewBinaryColumnBuilder and NewStringColumnBulder

* fix incorrect serialize_size

* fix incorrect serialize_size

* lint

* lint

* fix tests

* use binary state

* use binary state

* update tests

* update tests

* update tests

* fix native view encoding

* fix

* [ci skip] updata kernel concat for view types

* [ci skip] improve kernels for view types

* [ci skip] only string type use string view type

* [ci skip] only string type use string view type

* fix tests

* [ci skip] fix tests

* [ci skip] fix

* fix

* use NewStringColumnBuilder

* rename NewString -> String

* fmt

* [ci skip] update tests

* optimize take

* add bench

* fix tests

* update

* improve compare

* implement compare using string view prefix

* fix

* fix

* fix

* fix-length

* disable spill

* [ci skip] add put_and_commit

* [ci skip] update

* update test

* lint

* [ci skip] add maybe gc

* fix endiness

* fix endiness

* fix

* update string compare

* update

---------

Co-authored-by: sundy-li <[email protected]>
@BohuTANG BohuTANG removed this pull request from the merge queue due to a manual request Nov 8, 2024
@BohuTANG BohuTANG merged commit f4e599f into databendlabs:main Nov 8, 2024
89 checks passed
@andylokandy andylokandy deleted the dev1 branch November 8, 2024 02:59
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants