New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

Sign up for GitHub

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Jump to bottom

[GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compresse(not bzip2) file is a single file split #7598

Merged

baibaichen merged 8 commits into apache:main from bigo-sg:gluten_7387

Nov 14, 2024

Contributor

taiyang-li commented Oct 18, 2024 •

edited

Loading

What changes were proposed in this pull request?

Allow parallel downloading in scan operator for hive text/json table. Currently it requires the whole file as file split(it only happens when file is compressed without bzip2). It is disabled by fault.
(Fixes: #7387)

How was this patch tested?

Test manually in production.

github-actions bot added the CLICKHOUSE label

github-actions bot commented Oct 18, 2024

github-actions bot commented Oct 18, 2024

Run Gluten Clickhouse CI


          enable parallel downloading for text/json

adc92bb

taiyang-li force-pushed the gluten_7387 branch from 845eb95 to adc92bb Compare

November 12, 2024 08:11

github-actions bot commented Nov 12, 2024

Run Gluten Clickhouse CI on x86

wip

8234c82

github-actions bot commented Nov 12, 2024

Run Gluten Clickhouse CI on x86

wip

7cdf92d

github-actions bot commented Nov 12, 2024

Run Gluten Clickhouse CI on x86


          finish dev

145d9cc

github-actions bot commented Nov 13, 2024

Run Gluten Clickhouse CI on x86

taiyang-li marked this pull request as ready for review

November 13, 2024 04:21

Contributor Author

taiyang-li commented Nov 13, 2024

Test result in production result
Query: select count(1) from text_table where day = '2024-09-01'

Run it without parallel download (--conf spark.gluten.sql.columnar.backend.ch.runtime_settings.max_download_threads=1)

Run it with parallel download (--conf spark.gluten.sql.columnar.backend.ch.runtime_settings.max_download_threads=4)

taiyang-li added 2 commits

November 13, 2024 14:38


          update version

09fe581


          update initialization of thread pool

664b20c

github-actions bot commented Nov 13, 2024

Run Gluten Clickhouse CI on x86


          fix style

38eb7e0

github-actions bot commented Nov 13, 2024

Run Gluten Clickhouse CI on x86


          Merge branch 'main' into gluten_7387

a971415

github-actions bot commented Nov 13, 2024

Run Gluten Clickhouse CI on x86

baibaichen added the need test label

taiyang-li changed the title ~~[GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table~~ [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole file is a single file split

taiyang-li changed the title ~~[GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole file is a single file split~~ [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compressed file is a single file split

taiyang-li changed the title ~~[GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compressed file is a single file split~~ [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compresse(not bzip2) file is a single file split

baibaichen approved these changes

View reviewed changes

baibaichen merged commit f8a2dca into apache:main

12 checks passed

Contributor

baibaichen commented Nov 14, 2024

为什么不支持 Biz2?

因为 bzip2 支持 file split，这个PR支持文件级别的并行下载

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

CLICKHOUSE need test