-
Notifications
You must be signed in to change notification settings - Fork 908
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[DRAFT] refactor to grid stride for data_type_detection kernel #14694
[DRAFT] refactor to grid stride for data_type_detection kernel #14694
Conversation
/ok to test |
@DanialJavady96 Can you update the copyright year to 2024 to unblock the CI? |
@@ -658,6 +658,21 @@ set_source_files_properties( | |||
PROPERTIES COMPILE_DEFINITIONS "_FILE_OFFSET_BITS=64" | |||
) | |||
|
|||
set_source_files_properties( |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think you've changed anything in csv_gpu.cu
that requires CMake alterations. Was there an improperly resolved merge conflict somewhere?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Currently the grid size is chosen such that the number of threads is equal to the row count. This is the core problem we want to address. We want to enable kernels to act on tables larger than the maximum CUDA grid size (currently this is 2^31 - 1 blocks for 1D grids, and blocks have a max of 1024 threads). See #10368.
It looks like the main change so far is to use a for
loop with the grid stride, which is half the solution. The other half of the solution is to change the way that the kernel is launched so that it doesn't depend on the row count -- just the kernel itself, to maximize throughput. You may need to do some benchmarking to determine the best strategy, but I think you'll want cudaOccupancyMaxPotentialBlockSize
. Search the codebase for grid_size
and you'll see various ways to pick a grid size. It's okay to benchmark on your RTX 3080 for testing purposes, and we can validate the final code for any perf changes on a larger GPU once the PR is ready to merge.
First, compute a grid size with cudaOccupancyMaxPotentialBlockSize
(or some other approach).
cudf/cpp/src/io/csv/csv_gpu.cu
Lines 798 to 800 in f134513
// Calculate actual block count to use based on records count | |
int const block_size = csvparse_block_dim; | |
int const grid_size = (row_starts.size() + block_size - 1) / block_size; |
Second, remove this comment once it's no longer true:
cudf/cpp/src/io/csv/csv_gpu.cu
Lines 162 to 163 in f134513
* Data is processed in one row/record at a time, so the number of total | |
* threads (tid) is equal to the number of rows. |
Ah yeah, like I mentioned in my OP I did use The benchmark I included shows that at a much lower occupancy (15% vs ~75%) it is performing better and I wanted to rule out the spec differences from my 3080 having much lower memory than a V100 for example |
Ah, thanks. I read the PR description but got a little lost while reading it. I appreciate the clarification. It surprises me that a lower occupancy grid/block configuration is performing better for you. @vuule Would you have any insights on this? I don't know the CSV kernels very well. |
All good! I get tunnel visioned easily and definitely could've worded this better and more concisely. Something for me to work on 😅 |
Thank you @DanialJavady96 for studying this issue. For now we should focus on preventing thread indexing overflow and not include performance tuning. Please also see #10368 and #13910 for reference. |
Just a small draft to make sure i am not misunderstanding anything before I modify more code. Please lmk if this is not how the grid stride is to be implemented. Addresses #14066
this benchmark is with higher occupancy, using the row count as a guide for grid size
below is with static grid_size at 256, much lower occupancy at 15.7%
cudaOccupancyMaxPotentialBlockSize
returns that 128 for block size and 612 for grid size is what achieves max occupancy btw.Some findings:
On my 3080, I am noticing better performance at 15.7% occupancy than I am at 65-75% occupancy. I do not have access to a workstation GPU so I cannot profile on what would be a practical scenario for a user using cuIO. Given that this kernel is largely memory bound, it is hard for me to simulate a real scenario. I hesitate drawing conclusions from my benchmarking on my consumer gpu.
I am also a bit curious to know what kind of performance gains should be derived from transition towards a grid stride. From my understanding, the main problems that exist(and are compounding off eachother) are:
So im curious to know, would it make more sense to prioritize modernizing the CSV reader? I don't have perspective on the timeline for that to be achieved so I can't comment.
From this paper, IIUC, I see a lot of interest in transitioning towards DFAs ,and converting to columnar format for example. A lot of interesting strategies like using bitmaps store positions of different symbols, scanning that to find what col and row the symbol belongs to, and then using a stable radix sort so you can convert to columnar format. This would solve a lot of the problems I mentioned earlier because the data layout will be converted to something much easier to operate on. With that being said I imagine the DFA needs to be worked and will be a challenge to get right. Since I lack context / experience with these workflows I'm curious to know how progress is faring on that and if I could help somehow.
Checklist