-
Notifications
You must be signed in to change notification settings - Fork 35
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[cudadev] Use Cooperative Groups in the Fishbone kernel #212
base: master
Are you sure you want to change the base?
[cudadev] Use Cooperative Groups in the Fishbone kernel #212
Conversation
@VinInn what do yo think ? |
91163ef
to
e607935
Compare
The performance seems to be the same with the two implementations :-/ while I was hoping to gain something by parallelising the first loop. |
OK, according to Running 3 times with original:
with #212, as of e607935:
with #212 , and adding explicit launch_bounds
|
To clarify: I'm asking for comments and testing here because it's simpler, but if we consider this a valid approach I'll make PRs for CMSSW first, and propagate the same changes here later. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Looks good to me (my comments below are rather minor). In general I found the cooperative group formalism easier to digest than the 2D grid.
I vaguely recall cooperative groups came with some constraints. Could you remind me of those? Was something related to MPS?
// __device__ | ||
// __forceinline__ | ||
__global__ void fishbone(GPUCACell::Hits const* __restrict__ hhp, | ||
template <int hitsPerBlock = 1, int threadsPerHit = 1> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Given that in this PR the template arguments are given explicitly in all callers, how necessary the default values are? I'm mostly concerned on the self-documentation of the code.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I added them so the CPU version could be called without specifying <1, 1>
, but forgot to remove them from the call.
I agree that we should remove either the default or the explicit arguments in the CPU case.
for (int idy = firstY, nt = nHits; idy < nt; idy += gridDim.y * blockDim.y) { | ||
auto const& vc = isOuterHitOfCell[idy]; | ||
// buffer used by the current thread | ||
float(&x)[maxCellsPerHit] = s_x[hitInBlock]; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I'm wondering if
float(&x)[maxCellsPerHit] = s_x[hitInBlock]; | |
auto& x = s_x[hitInBlock]; |
would be more clear.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I can't say which one is clearer... your suggestion is certainly simpler to write :-)
They used to, but things have improved with CUDA 11:
|
if faster, why not. |
Modify the Fishbone kernel to use a 1D grid, shared memory, and parallelise the internal loops using Cooperative Groups.