-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122
Comments
@nvdbaranec @vuule fyi |
That's wild. Have you maybe looked at a profile to see how we spend the additional time? |
it's over many queries, I think it is due to the pageable copy (synchronization due to staging copies). I wonder if a quick prototype would be to swap the pageable copy by a pinned one and see if we recover the amount of time lost. |
I can prototype something tomorrow. My concern is that we'll break even without the pinned pool due to the extra pinned allocation. |
this would imply having @nvdbaranec's pinned pool #15079, so the allocation should be virtually free :) |
yeah but only spark will have the pool for now ;) |
…15140) Issue #15122 The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck. This PR aims to eliminate most of the error checking overhead by using `hostdevice_vector` in the `kernel_error` class. The `hostdevice_vector` uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Paul Mattione (https://github.com/pmattione-nvidia) URL: #15140
While benchmarking some code we found that about 5% worth of time are being lost due to this line of code:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L248
This is on our perf cluster (A100) for NDS @3k. It explains some of a dip in perf we have seen since 23.10 but we haven't gotten around to testing.
If I stop obtaining the value from
error_code
(e.g. I don't perform the pageable memcpy essentially) we gain 20 seconds locally. I am filing this because it may be a good idea to remove this or look into how to improve it (would a pinned copy help)?The text was updated successfully, but these errors were encountered: