Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

Open
revans2 opened this issue Jan 10, 2025 · 0 comments
Open

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

revans2 opened this issue Jan 10, 2025 · 0 comments
Assignees
Labels
feature request New feature or request

Comments

@revans2
Copy link
Collaborator

revans2 commented Jan 10, 2025

Is your feature request related to a problem? Please describe.
rapidsai/cudf#17716 is a feature in CUDF with the goal of reducing the number of times we need to parse the parquet footer, both in the worst case and in the common case.

It is a first step to hopefully be able to do lazy decoding and lazy fetching of parquet data, which should hopefully improve many queries and reduce or eliminate the need to for hybrid scan features in spark-rapids.

The idea is to first implement this for the per-file use case (probably under an experimental flag). If it works well then we can look to expand it to multi-threaded and combining use cases too.

The hope is that we could fetch the footer bytes for a file (i.e. last X MiB) and hand it to CUDF. CUDF would parse it and return the metadata to us, or tell us we need to read more, but we could check the length ourselves before we hand it over.. We could then modify the parsed footer metadata to do column pruning and to select only the row groups in our split. After that we would hand the metadata to CUDF along with any predicate push down we want to run and get an I/O plan from CUDF to fetch pages associated with different row groups. With this plan we could decide how we want to process the row groups. We could do one or more row groups using similar heuristics that we have today. Once we have fetched the data for the row groups we want to read in a group, we can grab the GPU semaphore, copy that data to the GPU and tell CUDF to parse it into a table.

The hope is that the parsing API would be stateless. That way when it is done , we can release the data associated with that row group.

In the first go around we are not going to worry about the chunked reader. Lazy decoding and chunked reading is a really hard problem to solve. We are not giving up on it, but may need to think about some alternative ways of reading the data, like providing a row range to CUDF when we read the data, if the initial read failed with an OOM, or we got back a table that was too large.

@revans2 revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 10, 2025
@mattahrens mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 14, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
feature request New feature or request
Projects
None yet
Development

No branches or pull requests

3 participants