[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

abellina · 2024-02-22T23:05:41Z

While benchmarking some code we found that about 5% worth of time are being lost due to this line of code:
https://github.com/rapidsai/cudf/blob/branch-24.04/cpp/src/io/parquet/reader_impl.cpp#L248

This is on our perf cluster (A100) for NDS @3k. It explains some of a dip in perf we have seen since 23.10 but we haven't gotten around to testing.

If I stop obtaining the value from error_code (e.g. I don't perform the pageable memcpy essentially) we gain 20 seconds locally. I am filing this because it may be a good idea to remove this or look into how to improve it (would a pinned copy help)?

The text was updated successfully, but these errors were encountered:

abellina · 2024-02-22T23:05:58Z

@nvdbaranec @vuule fyi

vuule · 2024-02-22T23:33:29Z

That's wild. Have you maybe looked at a profile to see how we spend the additional time?

abellina · 2024-02-23T00:05:07Z

it's over many queries, I think it is due to the pageable copy (synchronization due to staging copies). I wonder if a quick prototype would be to swap the pageable copy by a pinned one and see if we recover the amount of time lost.

vuule · 2024-02-23T00:10:17Z

I can prototype something tomorrow. My concern is that we'll break even without the pinned pool due to the extra pinned allocation.

abellina · 2024-02-23T00:11:09Z

this would imply having @nvdbaranec's pinned pool #15079, so the allocation should be virtually free :)

vuule · 2024-02-23T00:13:07Z

yeah but only spark will have the pool for now ;)

GregoryKimball · 2024-02-23T14:59:51Z

Thank you @abellina for posting this. Please also consult #14167 and #14415 for a similar error code check and performance fix. Would you please investigate a single-GPU single-node reproducer for this performance issue? Do you think a single-GPU reproducer is possible?

…15140) Issue #15122 The addition of kernel error checking introduced a 5% performance regression in Spark-RAPIDS. It was determined that the pageable copy of the error back to host caused this overhead, presumably because of the CUDA's bounce buffer bottleneck. This PR aims to eliminate most of the error checking overhead by using `hostdevice_vector` in the `kernel_error` class. The `hostdevice_vector` uses pinned memory so the copy is no longer pageable. The PR also removes the redundant sync after we read the error. Authors: - Vukasin Milovanovic (https://github.com/vuule) Approvers: - Mike Wilson (https://github.com/hyperbolic2346) - Paul Mattione (https://github.com/pmattione-nvidia) URL: #15140

abellina added the bug Something isn't working label Feb 22, 2024

vuule mentioned this issue Feb 24, 2024

Use hostdevice_vector in kernel_error to avoid the pageable copy #15140

Merged

3 tasks

GregoryKimball closed this as completed Jun 2, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

abellina commented Feb 22, 2024 •

edited

Loading

abellina commented Feb 22, 2024

vuule commented Feb 22, 2024

abellina commented Feb 23, 2024

vuule commented Feb 23, 2024

abellina commented Feb 23, 2024

vuule commented Feb 23, 2024

GregoryKimball commented Feb 23, 2024

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

[BUG] parquet reader::impl::decode_page_data error_code checking slowness #15122

Comments

abellina commented Feb 22, 2024 • edited Loading

abellina commented Feb 22, 2024

vuule commented Feb 22, 2024

abellina commented Feb 23, 2024

vuule commented Feb 23, 2024

abellina commented Feb 23, 2024

vuule commented Feb 23, 2024

GregoryKimball commented Feb 23, 2024

abellina commented Feb 22, 2024 •

edited

Loading