Performance of kokkos serial backend vs. plain serial code #297

markdewing · 2022-01-21T18:15:24Z

The time from the kokkos serial backend (kokkos --serial) is slower for one thread than the standalone serial code (serial).

Looking at profiles, the function kernel_connect ( in plugin-PixelTriplets) takes significantly more time in the kokkos version. From the instructions retired, it is clearly performing more operations in the kokkos version.

The loops do not perform their iterations in the same order.

<outer loop from kokkos>
  <kernel_connect function body>
     for (int idx = firstCellIndex, nt = (nCells()); idx < nt; idx += leagueSize * blockDim) {
     ...
       for (int j = first; j < numberOfPossibleNeighbors; j += stride) {

The serial version has no outer loop. However, printing the values for idx and j shows the same values get accessed, just in a different order between the versions.

It looks like (based on instructions retired), the kokkos version is doing more work (by a factor of 2x or more), in routines like areAlignedRZ. But based on printing out how many times it's called, they should be the same.

The text was updated successfully, but these errors were encountered:

markdewing · 2022-02-25T21:23:59Z

In plugin-SiPixelClusterizer, there's a difference between the serial code and Kokkos code.
In gpuClusterChargeCut.h, there is a loop that sets some variables to zero.

Kokkos version (around line 97)

         for (uint32_t i = teamMember.team_rank(); i < ::gpuClustering::MaxNumClustersPerModules;
               i += teamMember.team_size()) {
            charge(i) = 0;
            ok(i) = 0;
            newclusId(i) = 0;
          }

The serial version (around line 58)

      for (auto i = threadIdx.x; i < nclus; i += blockDim.x) {
        charge[i] = 0;
      }

Two reasons the Kokkos version is doing more work:

the serial version of the loop bound is nclus (about 10) but the Kokkos version loop bound is MaxNumClustersPerModules (1024).
the Kokkos version also sets ok and newclusid to zero. These get set in a later loop, I'm not sure this is necessary.

If the Kokkos loop is changed to match the serial loop, the performance becomes similar.

Notes

The starting value (0) and increment (1) of the loops is the same for both.
The performance is similar at least for a sufficiently large number of events - there must be some startup overhead to Kokkos. 1000 events is sufficient - both get 27-28 events/s on my machine. Using 100 events shows no drop in performance of serial code (27 events/s) where the kokkos serial version clearly gets lower performance (24.5 events/s)
I didn't check whether there is a correctness issue with the change to the Kokkos loop.

Link to Kokkos version:
https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/kokkos/plugin-SiPixelClusterizer/kokkos/gpuClusterChargeCut.h

Link to serial version:
https://github.com/cms-patatrack/pixeltrack-standalone/blob/master/src/serial/plugin-SiPixelClusterizer/gpuClusterChargeCut.h

makortel · 2022-02-25T21:36:15Z

@markdewing Thanks, good catch. Digging from history both of these (the loop bound and initialization of additional variables) was done in #80, and probably this comment #80 (comment) explains the reasoning. Since both serial and cuda versions do otherwise, I think we should do the same for kokkos version too since they appear to affect performance.

makortel added the kokkos label Jan 21, 2022

markdewing mentioned this issue Mar 10, 2022

[kokkos] serial backend performance fix #338

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Performance of kokkos serial backend vs. plain serial code #297

Performance of kokkos serial backend vs. plain serial code #297

markdewing commented Jan 21, 2022

markdewing commented Feb 25, 2022

makortel commented Feb 25, 2022

Performance of kokkos serial backend vs. plain serial code #297

Performance of kokkos serial backend vs. plain serial code #297

Comments

markdewing commented Jan 21, 2022

markdewing commented Feb 25, 2022

makortel commented Feb 25, 2022