You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Is your feature request related to a problem? Please describe. rapidsai/cudf#17716 is a feature in CUDF with the goal of reducing the number of times we need to parse the parquet footer, both in the worst case and in the common case.
It is a first step to hopefully be able to do lazy decoding and lazy fetching of parquet data, which should hopefully improve many queries and reduce or eliminate the need to for hybrid scan features in spark-rapids.
The idea is to first implement this for the per-file use case (probably under an experimental flag). If it works well then we can look to expand it to multi-threaded and combining use cases too.
The hope is that we could fetch the footer bytes for a file (i.e. last X MiB) and hand it to CUDF. CUDF would parse it and return the metadata to us, or tell us we need to read more, but we could check the length ourselves before we hand it over.. We could then modify the parsed footer metadata to do column pruning and to select only the row groups in our split. After that we would hand the metadata to CUDF along with any predicate push down we want to run and get an I/O plan from CUDF to fetch pages associated with different row groups. With this plan we could decide how we want to process the row groups. We could do one or more row groups using similar heuristics that we have today. Once we have fetched the data for the row groups we want to read in a group, we can grab the GPU semaphore, copy that data to the GPU and tell CUDF to parse it into a table.
The hope is that the parsing API would be stateless. That way when it is done , we can release the data associated with that row group.
In the first go around we are not going to worry about the chunked reader. Lazy decoding and chunked reading is a really hard problem to solve. We are not giving up on it, but may need to think about some alternative ways of reading the data, like providing a row range to CUDF when we read the data, if the initial read failed with an OOM, or we got back a table that was too large.
The text was updated successfully, but these errors were encountered:
Is your feature request related to a problem? Please describe.
rapidsai/cudf#17716 is a feature in CUDF with the goal of reducing the number of times we need to parse the parquet footer, both in the worst case and in the common case.
It is a first step to hopefully be able to do lazy decoding and lazy fetching of parquet data, which should hopefully improve many queries and reduce or eliminate the need to for hybrid scan features in spark-rapids.
The idea is to first implement this for the per-file use case (probably under an experimental flag). If it works well then we can look to expand it to multi-threaded and combining use cases too.
The hope is that we could fetch the footer bytes for a file (i.e. last X MiB) and hand it to CUDF. CUDF would parse it and return the metadata to us, or tell us we need to read more, but we could check the length ourselves before we hand it over.. We could then modify the parsed footer metadata to do column pruning and to select only the row groups in our split. After that we would hand the metadata to CUDF along with any predicate push down we want to run and get an I/O plan from CUDF to fetch pages associated with different row groups. With this plan we could decide how we want to process the row groups. We could do one or more row groups using similar heuristics that we have today. Once we have fetched the data for the row groups we want to read in a group, we can grab the GPU semaphore, copy that data to the GPU and tell CUDF to parse it into a table.
The hope is that the parsing API would be stateless. That way when it is done , we can release the data associated with that row group.
In the first go around we are not going to worry about the chunked reader. Lazy decoding and chunked reading is a really hard problem to solve. We are not giving up on it, but may need to think about some alternative ways of reading the data, like providing a row range to CUDF when we read the data, if the initial read failed with an OOM, or we got back a table that was too large.
The text was updated successfully, but these errors were encountered: