-
Notifications
You must be signed in to change notification settings - Fork 908
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Implement
cudf-polars
chunked parquet reading (#16944)
This PR provides access to the libcudf chunked parquet reader through the `cudf-polars` gpu engine, inspired by the cuDF python implementation. Closes #16818 Authors: - https://github.com/brandon-b-miller - GALI PREM SAGAR (https://github.com/galipremsagar) - Lawrence Mitchell (https://github.com/wence-) Approvers: - Vyas Ramasubramani (https://github.com/vyasr) - Lawrence Mitchell (https://github.com/wence-) URL: #16944
- Loading branch information
1 parent
d475dca
commit aa8c0c4
Showing
11 changed files
with
297 additions
and
66 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,25 @@ | ||
# GPUEngine Configuration Options | ||
|
||
The `polars.GPUEngine` object may be configured in several different ways. | ||
|
||
## Parquet Reader Options | ||
Reading large parquet files can use a large amount of memory, especially when the files are compressed. This may lead to out of memory errors for some workflows. To mitigate this, the "chunked" parquet reader may be selected. When enabled, parquet files are read in chunks, limiting the peak memory usage at the cost of a small drop in performance. | ||
|
||
|
||
To configure the parquet reader, we provide a dictionary of options to the `parquet_options` keyword of the `GPUEngine` object. Valid keys and values are: | ||
- `chunked` indicates that chunked parquet reading is to be used. By default, chunked reading is turned on. | ||
- [`chunk_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum size per chunk. By default, the maximum chunk size is unlimited. | ||
- [`pass_read_limit`](https://docs.rapids.ai/api/libcudf/legacy/classcudf_1_1io_1_1chunked__parquet__reader#aad118178b7536b7966e3325ae1143a1a) controls the maximum memory used for decompression. The default pass read limit is 16GiB. | ||
|
||
For example, to select the chunked reader with custom values for `pass_read_limit` and `chunk_read_limit`: | ||
```python | ||
engine = GPUEngine( | ||
parquet_options={ | ||
'chunked': True, | ||
'chunk_read_limit': int(1e9), | ||
'pass_read_limit': int(4e9) | ||
} | ||
) | ||
result = query.collect(engine=engine) | ||
``` | ||
Note that passing `chunked: False` disables chunked reading entirely, and thus `chunk_read_limit` and `pass_read_limit` will have no effect. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.