[cudadev] Use Cooperative Groups in the Fishbone kernel #212

fwyzard · 2021-08-29T22:03:25Z

Modify the Fishbone kernel to use a 1D grid, shared memory, and parallelise the internal loops using Cooperative Groups.

…ed at compile time

fwyzard · 2021-08-29T22:03:43Z

@VinInn what do yo think ?

fwyzard · 2021-08-30T16:35:17Z

The performance seems to be the same with the two implementations :-/ while I was hoping to gain something by parallelising the first loop.

fwyzard · 2021-08-30T16:49:01Z

OK, according to nsys the kernel is a bit faster.

Running 3 times with nsys profile --export=sqlite ./cudadev --numberOfThreads 8 --numberOfStreams 8 --maxEvents 4000 I get

original:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     5.3      465,423,675      4,000  116,355.9   19,936    375,615   43,051.5  fishbone                    
     5.2      481,179,866      4,000  120,295.0   20,352    400,831   44,165.2  fishbone                    
     5.2      481,588,194      4,000  120,397.0   21,024    385,471   45,293.1  fishbone

with #212, as of `e607935`:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     4.5      396,360,658      4,000   99,090.2   22,336    347,423   32,068.5  fishbone                    
     4.6      385,110,561      4,000   96,277.6   17,568    320,287   30,638.5  fishbone                    
     4.5      393,259,397      4,000   98,314.8   18,912    502,622   33,137.3  fishbone

with #212 , and adding explicit launch_bounds`to the`fishbone` kernel, as of `c30ff98`:

 Time(%)  Total Time (ns)  Instances   Average   Minimum    Maximum   StdDev    Name
     4.6      389,048,712      4,000   97,262.2   18,464    443,966   31,847.3  fishbone                    
     4.5      387,677,145      4,000   96,919.3   19,136    508,478   31,795.1  fishbone                    
     4.5      392,568,936      4,000   98,142.2   19,584    481,886   32,359.8  fishbone

fwyzard · 2021-08-30T16:50:31Z

To clarify: I'm asking for comments and testing here because it's simpler, but if we consider this a valid approach I'll make PRs for CMSSW first, and propagate the same changes here later.

makortel

Looks good to me (my comments below are rather minor). In general I found the cooperative group formalism easier to digest than the 2D grid.

I vaguely recall cooperative groups came with some constraints. Could you remind me of those? Was something related to MPS?

makortel · 2021-08-30T17:45:41Z

src/cudadev/plugin-PixelTriplets/gpuFishbone.h

-  //  __device__
-  //  __forceinline__
-  __global__ void fishbone(GPUCACell::Hits const* __restrict__ hhp,
+  template <int hitsPerBlock = 1, int threadsPerHit = 1>


Given that in this PR the template arguments are given explicitly in all callers, how necessary the default values are? I'm mostly concerned on the self-documentation of the code.

I added them so the CPU version could be called without specifying <1, 1>, but forgot to remove them from the call.

I agree that we should remove either the default or the explicit arguments in the CPU case.

makortel · 2021-08-30T17:48:55Z

src/cudadev/plugin-PixelTriplets/gpuFishbone.h

-    for (int idy = firstY, nt = nHits; idy < nt; idy += gridDim.y * blockDim.y) {
-      auto const& vc = isOuterHitOfCell[idy];
+    // buffer used by the current thread
+    float(&x)[maxCellsPerHit] = s_x[hitInBlock];


I'm wondering if

Suggested change

float(&x)[maxCellsPerHit] = s_x[hitInBlock];

auto& x = s_x[hitInBlock];

would be more clear.

I can't say which one is clearer... your suggestion is certainly simpler to write :-)

fwyzard · 2021-08-30T18:04:16Z

I vaguely recall cooperative groups came with some constraints. Could you remind me of those? Was something related to MPS?

They used to, but things have improved with CUDA 11:

C.2. What's New in CUDA 11.0

Separate compilation is no longer required to use the grid-scoped group and synchronizing this group is now up to 30% faster. Additionally we've enabled cooperative launches on latest Windows platforms, and added support for them when running under MPS.

VinInn · 2021-08-31T06:46:21Z

if faster, why not.

fwyzard added 2 commits August 28, 2021 22:42

Improve variable names and code readability

4aa01cc

Use a 1D grid for the fishbone kernel, with the inner loop stride fix…

dfd5a10

…ed at compile time

fwyzard marked this pull request as draft August 29, 2021 22:34

fwyzard added 4 commits August 30, 2021 18:01

Move fishbone buffers to shared memory

1438eaa

Parallelise the first inner loop of the Fishbone algorithm

1d3bdd6

Minimal, incomplete serial CPU implementation of Cooperative Groups

416a7e2

Use Cooperative Groups for a more fine-grained synchronisation

e607935

fwyzard force-pushed the fishbone_use_cooperative_groups branch from 91163ef to e607935 Compare August 30, 2021 16:34

fwyzard marked this pull request as ready for review August 30, 2021 16:35

fwyzard requested review from makortel and VinInn August 30, 2021 16:50

fwyzard self-assigned this Aug 30, 2021

fwyzard added cuda cudacompat labels Aug 30, 2021

makortel reviewed Aug 30, 2021

View reviewed changes

fwyzard added 2 commits August 30, 2021 23:38

Ignore __launch_bounds__ in CPU code

f7a1998

Optimise the fishbone kernel for a fixed block size

c30ff98

fwyzard marked this pull request as draft October 14, 2021 08:50

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[cudadev] Use Cooperative Groups in the Fishbone kernel #212

[cudadev] Use Cooperative Groups in the Fishbone kernel #212

fwyzard commented Aug 29, 2021

fwyzard commented Aug 29, 2021

fwyzard commented Aug 30, 2021

fwyzard commented Aug 30, 2021 •

edited

Loading

fwyzard commented Aug 30, 2021

makortel left a comment

makortel Aug 30, 2021

fwyzard Aug 30, 2021

makortel Aug 30, 2021

fwyzard Aug 30, 2021

fwyzard commented Aug 30, 2021

C.2. What's New in CUDA 11.0

VinInn commented Aug 31, 2021

	float(&x)[maxCellsPerHit] = s_x[hitInBlock];
	auto& x = s_x[hitInBlock];

[cudadev] Use Cooperative Groups in the Fishbone kernel #212

Are you sure you want to change the base?

[cudadev] Use Cooperative Groups in the Fishbone kernel #212

Conversation

fwyzard commented Aug 29, 2021

fwyzard commented Aug 29, 2021

fwyzard commented Aug 30, 2021

fwyzard commented Aug 30, 2021 • edited Loading

original:

with #212, as of e607935:

with #212 , and adding explicit launch_boundsto thefishbone` kernel, as of c30ff98:

fwyzard commented Aug 30, 2021

makortel left a comment

Choose a reason for hiding this comment

makortel Aug 30, 2021

Choose a reason for hiding this comment

fwyzard Aug 30, 2021

Choose a reason for hiding this comment

makortel Aug 30, 2021

Choose a reason for hiding this comment

fwyzard Aug 30, 2021

Choose a reason for hiding this comment

fwyzard commented Aug 30, 2021

C.2. What's New in CUDA 11.0

VinInn commented Aug 31, 2021

fwyzard commented Aug 30, 2021 •

edited

Loading

with #212, as of `e607935`:

with #212 , and adding explicit launch_bounds`to the`fishbone` kernel, as of `c30ff98`: