Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

[cudadev] Use Cooperative Groups in the Fishbone kernel #212

Draft
wants to merge 8 commits into
base: master
Choose a base branch
from

Conversation

fwyzard
Copy link
Contributor

@fwyzard fwyzard commented Aug 29, 2021

Modify the Fishbone kernel to use a 1D grid, shared memory, and parallelise the internal loops using Cooperative Groups.

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 29, 2021

@VinInn what do yo think ?

@fwyzard fwyzard marked this pull request as draft August 29, 2021 22:34
@fwyzard fwyzard force-pushed the fishbone_use_cooperative_groups branch from 91163ef to e607935 Compare August 30, 2021 16:34
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 30, 2021

The performance seems to be the same with the two implementations :-/ while I was hoping to gain something by parallelising the first loop.

@fwyzard fwyzard marked this pull request as ready for review August 30, 2021 16:35
@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 30, 2021

OK, according to nsys the kernel is a bit faster.

Running 3 times with nsys profile --export=sqlite ./cudadev --numberOfThreads 8 --numberOfStreams 8 --maxEvents 4000 I get

original:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     5.3      465,423,675      4,000  116,355.9   19,936    375,615   43,051.5  fishbone                    
     5.2      481,179,866      4,000  120,295.0   20,352    400,831   44,165.2  fishbone                    
     5.2      481,588,194      4,000  120,397.0   21,024    385,471   45,293.1  fishbone                    

with #212, as of e607935:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     4.5      396,360,658      4,000   99,090.2   22,336    347,423   32,068.5  fishbone                    
     4.6      385,110,561      4,000   96,277.6   17,568    320,287   30,638.5  fishbone                    
     4.5      393,259,397      4,000   98,314.8   18,912    502,622   33,137.3  fishbone                    

with #212 , and adding explicit launch_boundsto thefishbone` kernel, as of c30ff98:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     4.6      389,048,712      4,000   97,262.2   18,464    443,966   31,847.3  fishbone                    
     4.5      387,677,145      4,000   96,919.3   19,136    508,478   31,795.1  fishbone                    
     4.5      392,568,936      4,000   98,142.2   19,584    481,886   32,359.8  fishbone                    

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 30, 2021

To clarify: I'm asking for comments and testing here because it's simpler, but if we consider this a valid approach I'll make PRs for CMSSW first, and propagate the same changes here later.

Copy link
Collaborator

@makortel makortel left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me (my comments below are rather minor). In general I found the cooperative group formalism easier to digest than the 2D grid.

I vaguely recall cooperative groups came with some constraints. Could you remind me of those? Was something related to MPS?

// __device__
// __forceinline__
__global__ void fishbone(GPUCACell::Hits const* __restrict__ hhp,
template <int hitsPerBlock = 1, int threadsPerHit = 1>
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Given that in this PR the template arguments are given explicitly in all callers, how necessary the default values are? I'm mostly concerned on the self-documentation of the code.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added them so the CPU version could be called without specifying <1, 1>, but forgot to remove them from the call.

I agree that we should remove either the default or the explicit arguments in the CPU case.

for (int idy = firstY, nt = nHits; idy < nt; idy += gridDim.y * blockDim.y) {
auto const& vc = isOuterHitOfCell[idy];
// buffer used by the current thread
float(&x)[maxCellsPerHit] = s_x[hitInBlock];
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'm wondering if

Suggested change
float(&x)[maxCellsPerHit] = s_x[hitInBlock];
auto& x = s_x[hitInBlock];

would be more clear.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I can't say which one is clearer... your suggestion is certainly simpler to write :-)

@fwyzard
Copy link
Contributor Author

fwyzard commented Aug 30, 2021

I vaguely recall cooperative groups came with some constraints. Could you remind me of those? Was something related to MPS?

They used to, but things have improved with CUDA 11:

C.2. What's New in CUDA 11.0

Separate compilation is no longer required to use the grid-scoped group and synchronizing this group is now up to 30% faster. Additionally we've enabled cooperative launches on latest Windows platforms, and added support for them when running under MPS.

@VinInn
Copy link
Contributor

VinInn commented Aug 31, 2021

if faster, why not.

@fwyzard fwyzard marked this pull request as draft October 14, 2021 08:50
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants