[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

revans2 · 2025-01-10T21:59:06Z

Is your feature request related to a problem? Please describe.
rapidsai/cudf#17716 is a feature in CUDF with the goal of reducing the number of times we need to parse the parquet footer, both in the worst case and in the common case.

It is a first step to hopefully be able to do lazy decoding and lazy fetching of parquet data, which should hopefully improve many queries and reduce or eliminate the need to for hybrid scan features in spark-rapids.

The idea is to first implement this for the per-file use case (probably under an experimental flag). If it works well then we can look to expand it to multi-threaded and combining use cases too.

The hope is that we could fetch the footer bytes for a file (i.e. last X MiB) and hand it to CUDF. CUDF would parse it and return the metadata to us, or tell us we need to read more, but we could check the length ourselves before we hand it over.. We could then modify the parsed footer metadata to do column pruning and to select only the row groups in our split. After that we would hand the metadata to CUDF along with any predicate push down we want to run and get an I/O plan from CUDF to fetch pages associated with different row groups. With this plan we could decide how we want to process the row groups. We could do one or more row groups using similar heuristics that we have today. Once we have fetched the data for the row groups we want to read in a group, we can grab the GPU semaphore, copy that data to the GPU and tell CUDF to parse it into a table.

The hope is that the parsing API would be stateless. That way when it is done , we can release the data associated with that row group.

In the first go around we are not going to worry about the chunked reader. Lazy decoding and chunked reading is a really hard problem to solve. We are not giving up on it, but may need to think about some alternative ways of reading the data, like providing a row range to CUDF when we read the data, if the initial read failed with an OOM, or we got back a table that was too large.

revans2 added ? - Needs Triage Need team to review and classify feature request New feature or request labels Jan 10, 2025

mattahrens removed the ? - Needs Triage Need team to review and classify label Jan 14, 2025

mattahrens assigned mythrocks Jan 14, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

revans2 commented Jan 10, 2025

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

[FEA] Work with CUDF to enable footer parsing of parquet files. #11954

Comments

revans2 commented Jan 10, 2025