Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compresse(not bzip2) file is a single file split #7598

Merged
merged 8 commits into from
Nov 14, 2024

Conversation

taiyang-li
Copy link
Contributor

@taiyang-li taiyang-li commented Oct 18, 2024

What changes were proposed in this pull request?

Allow parallel downloading in scan operator for hive text/json table. Currently it requires the whole file as file split(it only happens when file is compressed without bzip2). It is disabled by fault.
(Fixes: #7387)

How was this patch tested?

Test manually in production.

Copy link

#7387

Copy link

Run Gluten Clickhouse CI

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@taiyang-li taiyang-li marked this pull request as ready for review November 13, 2024 04:21
@taiyang-li
Copy link
Contributor Author

Test result in production result
Query: select count(1) from text_table where day = '2024-09-01'

Run it without parallel download (--conf spark.gluten.sql.columnar.backend.ch.runtime_settings.max_download_threads=1)
image
image

Run it with parallel download (--conf spark.gluten.sql.columnar.backend.ch.runtime_settings.max_download_threads=4)
image
image

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

Copy link

Run Gluten Clickhouse CI on x86

@taiyang-li taiyang-li changed the title [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole file is a single file split Nov 14, 2024
@taiyang-li taiyang-li changed the title [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole file is a single file split [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compressed file is a single file split Nov 14, 2024
@taiyang-li taiyang-li changed the title [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compressed file is a single file split [GLUTEN-7387][CH] Allow parallel downloading in scan operator for hive text/json table when the whole compresse(not bzip2) file is a single file split Nov 14, 2024
@baibaichen baibaichen merged commit f8a2dca into apache:main Nov 14, 2024
12 checks passed
@baibaichen
Copy link
Contributor

为什么不支持 Biz2?

因为 bzip2 支持 file split,这个PR支持文件级别的并行下载

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

[CH] apply parallel read buffer for text/json format since libhdfs3 supports pread now
2 participants