Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat(query): Support use parquet format when spilling #16612

Merged
merged 6 commits into from
Oct 17, 2024

Conversation

forsaken628
Copy link
Collaborator

@forsaken628 forsaken628 commented Oct 15, 2024

I hereby agree to the terms of the CLA available at: https://docs.databend.com/dev/policies/cla/

Summary

Support use parquet format when spilling, you can switch to arrow ipc via set spilling_file_format = 'arrow'.

Tests

  • Unit Test
  • Logic Test
  • Benchmark Test
  • No Test - Explain why

Type of change

  • Bug Fix (non-breaking change which fixes an issue)
  • New Feature (non-breaking change which adds functionality)
  • Breaking Change (fix or feature that could cause existing functionality not to work as expected)
  • Documentation Update
  • Refactoring
  • Performance Improvement
  • Other (please describe):

This change is Reviewable

Signed-off-by: coldWater <[email protected]>
Signed-off-by: coldWater <[email protected]>
Signed-off-by: coldWater <[email protected]>
@github-actions github-actions bot added the pr-feature this PR introduces a new feature to the codebase label Oct 15, 2024
Copy link

what-the-diff bot commented Oct 15, 2024

PR Summary

  • Enhanced Data Configuration
    The team has updated the configurations for a component called SpillerConfig. It's now using a technology called parquet that helps in storing and retrieving data more efficiently.

  • Revamped Data Spilling Method
    The way our software handles 'spilling' (overflow of data from main memory to a backup storage) has been improved. It's now dealing with multiple blocks of data at once (vectorized spill), instead of one at a time. This should speed up any operations involving large amounts of data.

  • Added Serialization Capabilities
    We've added a new feature via the serialize.rs module. This allows our software to convert data blocks into Parquet format and vice versa. Parquet is great for efficiently handling large volumes of data, making operations faster and less resource-intensive.

  • Improved Data Management
    We have made significant improvements to the WindowPartitionBuffer, which manages overflowed data partitions. This should lead to a better management and organization of such data, helping operations run more smoothly.

  • Flexible Spilling Option
    A new setting (spilling_use_parquet) has been introduced that allows the choice between using Parquet or Arrow IPC format for data spilling. This flexibility means we can choose the format that best suits our current needs, optimizing performance and resource use.

  • Optimized Data Operations
    By refining the data handling functions, we have made data spill operations more efficient and clearly defined. This makes the code easier to maintain and could lead to quicker development in the future.

@forsaken628 forsaken628 mentioned this pull request Oct 15, 2024
4 tasks
@forsaken628
Copy link
Collaborator Author

Benchmark:

dataset: tpch sf100

settings:

set max_memory_usage = 16*1024*1024*1024;
set window_partition_spilling_memory_ratio = 30;
set window_partition_spilling_to_disk_bytes_limit = 30*1024*1024*1024;

sql

EXPLAIN ANALYZE SELECT
    l_orderkey,
    l_partkey,
    l_quantity,
    l_extendedprice,
    l_shipinstruct,
    l_shipmode,
    ROW_NUMBER() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS row_num,
    RANK() OVER (PARTITION BY l_orderkey ORDER BY l_extendedprice DESC) AS rank_num
FROM
    lineitem ignore_result;
set spilling_use_parquet = 0; 

        ├── estimated rows: 600037902.00
        ├── cpu time: 651.285131424s
        ├── wait time: 168.630024827s
        ├── output rows: 600.04 million
        ├── output bytes: 44.87 GiB

        ├── numbers local spilled by write: 208
        ├── bytes local spilled by write: 15.06 GiB
        ├── local spilled time by write: 136.856s

        ├── numbers local spilled by read: 3072
        ├── bytes local spilled by read: 15.06 GiB
        ├── local spilled time by read: 31.933s
set spilling_use_parquet = 1; 

        ├── estimated rows: 600037902.00
        ├── cpu time: 848.406496078s
        ├── wait time: 73.858260885s
        ├── output rows: 600.04 million
        ├── output bytes: 44.87 GiB

        ├── numbers local spilled by write: 208
        ├── bytes local spilled by write: 9.56 GiB
        ├── local spilled time by write: 55.665s

        ├── numbers local spilled by read: 3072
        ├── bytes local spilled by read: 9.56 GiB
        ├── local spilled time by read: 17.512s

Compared with arrow ipc, the optimization of parquet's file size mainly comes from dictionary encoding. parquet's cpu usage is quite high at the same time. There is no significant advantage for highly discrete data.

Signed-off-by: coldWater <[email protected]>
@forsaken628 forsaken628 marked this pull request as ready for review October 15, 2024 14:02
@forsaken628 forsaken628 added the ci-cloud Build docker image for cloud test label Oct 15, 2024
Copy link
Contributor

Docker Image for PR

  • tag: pr-16612-3f8af35-1729002801

note: this image tag is only available for internal use,
please check the internal doc for more details.

Copy link
Member

@Dousir9 Dousir9 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM !

@sundy-li
Copy link
Member

LGTM, need rebase.

@forsaken628 forsaken628 added this pull request to the merge queue Oct 17, 2024
@github-merge-queue github-merge-queue bot removed this pull request from the merge queue due to failed status checks Oct 17, 2024
@sundy-li sundy-li added this pull request to the merge queue Oct 17, 2024
Merged via the queue into databendlabs:main with commit 7a9a7a4 Oct 17, 2024
72 checks passed
@forsaken628 forsaken628 deleted the spill-parquet branch October 18, 2024 01:33
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
ci-cloud Build docker image for cloud test pr-feature this PR introduces a new feature to the codebase
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants