Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165

ghugo83 · 2021-01-27T12:04:26Z

Run smoothly and results are consistent with 'pure CUDA' versions.

…sues

…t. Still issue in TBB case.

…alls when queue is passed as an argument (even as a reference)

…->psws also needs to be set

…ated within a loop.

…w all assert pass and histo values are identical as with CUDA (for same input matrix v).

… and OneToManyAssoc_t

makortel · 2021-01-27T14:29:28Z

src/alpaka/AlpakaCore/HistoContainer.h

+#include "AlpakaCore/alpakastdAlgorithm.h"
+#include "AlpakaCore/prefixScan.h"
+
+using namespace ALPAKA_ACCELERATOR_NAMESPACE;


We should not do using namespace ... in headers.

Which reminds me that one using namespace alpaka_common; slipped through in #160, apparently I even forgot to comment about it.

Ha yes will remove

makortel · 2021-01-27T15:22:07Z

src/alpaka/AlpakaCore/HistoContainer.h

+            assert(ih < int(nh));
+            h->count(acc, v[i], ih);
+          }
+          endElementIdx += gridDimension;


Just a thought, I'd find it a bit clearer if this increment would be moved in the outer for-loop statement

for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension, endElementIdx += gridDimension) {

I was very confused first, partially because I was missing this increment of endElementIdx.

Yes sure. Anyway will probably need to add an additional helper, as the present one is not adapted for the += gridDimension stride.

makortel · 2021-01-27T15:37:31Z

src/alpaka/AlpakaCore/HistoContainer.h

+
+        for (uint32_t i = firstElementIdxGlobal[0u]; i < endElementIdxGlobal[0u]; ++i) {
+          h->off[i] = 0;
+        }


The CUDA implementation uses memset. I suppose we could use alpaka::mem::view::set, but would that work only with buffers from alpaka::mem::buf::alloc() (i.e. not with bare pointers)?

Yes at first I had just tried a alpaka::mem::view::set on the raw pointer (equivalent to the poff in the CUDA version), but that was not working. Also with the additional complication that we do not want to set h to null, but only h->off.
Since ideally we want to keep the same interface as for CUDA (passing around a pointer to Histo as argument of fillManyFromVector), this was the only way to access h->off: on the device, which in a way, makes sense.

makortel · 2021-01-27T15:44:31Z

src/alpaka/AlpakaCore/HistoContainer.h

+
+      const int num_items = Histo::totbins();
+
+      auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items));


Why not using HistoContainer::psws like in CUDA implementation?

Basically the prefixScan have slightly different interfaces in the CUDA and ALPAKA versions. In the Alpaka version, there is no psws argument for doing a prefixScan.
I had looked at that specific point, and from what I remember, psws also exists in the Alpaka implementation of prefixScan, but as an internal variable (ws or sth).

The issue is that here we are actually interested in storing the ws value to Histo (even if I had looked up, and at least in the pixeltrack-standalone, myHisto->psws seems to never be used eventually).
To preserve the same Histo interface, I have, by safety, decided to set it later on (it actually very simply corresponds to the number of blocks). It is what storePrefixScanWorkingSpace is doing.
I have compared the Histo off and psws data members between CUDA and Alpaka versions and they are identical.

I guess in ideal it would be better to have the exact same interface for a prefixScan with CUDA and Alpaka, but maybe there are reasons I miss.
In any case was planning to do a review of prefixScan for my understanding:

There are calls to alpaka::wait::wait(queue); in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device).

I see multiplications of the number of elements by sizeof(type), while with Alpaka only the number of elements should be specified (the multiplication with sizeof(type) is already done internally). Hence I guess the tests are done with 'too many' elements.

I would like to check the SERIAL and TBB cases and handling of elements to be more confident about it.

In the Alpaka version, there is no psws argument for doing a prefixScan.

Ok, so the situation is a bit more complicated.

The CUDA version takes pc, that is used for the "synchronization point" to know which of the threads belong to the "last block". Alpaka version does not need that because that synchronization is handled with splitting the code in two kernels.

The Alpaka version takes psum (well the ...FirstStep doesn't seem to really need it), which in CUDA is handled as extern __shared__ T psum[] (does Alpaka support dynamic size array in shared memory?).

In both cases the actual value of pc/psum is irrelevant outside of the kernels, so maybe we could re-purpose HistoContainer::ppsws for the psum here? I think we are not interested in the actual content of psws, it seems to be used only as temporary workspace (and e.g. in Kokkos version HistoContainer does not have psws member at all).

In any case was planning to do a review of prefixScan for my understanding:

Sounds like it could be a good idea.

There are calls to alpaka::wait::wait(queue); in between kernels, which I do not understand and maybe are not needed (we should only need those after a copy from device to host, when interested in what has been copied, or before a host function returns; well, whenever the host needs info from device).

I agree, we should not have alpaka::wait::wait() calls within these helper functions.

I think we are not interested in the actual content of psws, it seems to be used only as temporary workspace

Yes I had only made a grep within the pixeltrack-standalone project, and there the Histogram data member called psws was never used.
Just doing a git grep in CMSSW, it appears it is never used there either.
I tend to think it is easier to just systematically keep the same interfaces, as it often eventually save issues dowsntream, but true that here, we can indeed also skip it, and simply just remove psws data member from Histo class.

we can indeed also skip it, and simply just remove psws data member from Histo class.

Note that in https://github.com/cms-patatrack/pixeltrack-standalone/pull/165/files#r565772291 I suggested to keep it and use it for the psum argument of the multiBlockPrefixScan*(). Then you could avoid this memory allocation.

I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

Yes doing that now.
Basically, since pc / psum are used internally only in PrefixScan, I do not really see why we should store psum inside Histo (even in place of psws). Additionally, psws was just a pointer to an uint32_t, whereas psum should be an array of num_items elements.

That's why I have just removed psws from Histo.
I thinkpsum should be addressed internally in PrefixScan, by indeed dynamic shared memory allocation, as for CUDA version.
I was thinking doing a prefixScan related PR, also modifying the prefixScan call sites, but ok this point can also be addressed in this PR. Doing a commit ~now :)

You're right, psws could not be used for psum because that is used as an array (whereas pc is used as an atomic counter).

makortel · 2021-01-27T15:45:44Z

src/alpaka/AlpakaCore/HistoContainer.h

+          queue,
+          alpaka::kernel::createTaskKernel<Acc1>(
+              workDivWith1Block, multiBlockPrefixScanSecondStep<uint32_t>(), poff, poff, psum_d, num_items, nblocks));
+      alpaka::wait::wait(queue);


Is this wait() really needed or could it be left for the caller to decide whether to block or not?

Yes that was a painful segfault to fix. Basically here, workdiv and kernels are defined in host function launchFinalize scope.
If that function returns before the device work is completed, we run into segfaults / spurious results.

I presume a clean way would be to only have one host function instead of intricated host functions in which device work is defined.
But since here this is just a test where we do not care about perf, and want to keep the code similar to the CUDA versions, I just added those.

Sounds strange. I would have naively expected alpaka::queue::enqueue() etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?

By quick look through Alpaka code they seem to copy both the work division and the functor within the call alpaka::kernel::createTaskKernel() call. This would need to be investigated further (because I'd really want these wait()s to go), but that can be done after this PR.

Sounds strange. I would have naively expected alpaka::queue::enqueue() etc to copy the work division. Or could it be the allocated buffer that must stay alive until the work finishes (that I believe)?

Yes, so looking at the execution task class (here for example with GPU CUDA backend):
http://alpaka-group.github.io/alpaka/TaskKernelGpuUniformCudaHipRt_8hpp_source.html#l00134

the kernel function object and its arguments, are move / copy constructed and stored as TaskKernel data members.

the workdiv is also move / copy constructed, and stored as data member of the base class (TaskKernel is derived from WorkDiv). http://alpaka-group.github.io/alpaka/WorkDivMembers_8hpp_source.html#l00052

Now, a raw pointer is passed as argument (poff), and was obtained from the Alpaka-equivalent to the get() (getPtrNative) from a reference-counting buffer handle.
So, if the handle is reference-counted to 0 before the device work finishes, we end up with a dangling pointer, and we are still screwed.
However, that is not the case, because the owning pointer is h_d, and is defined out of the scope of this function.

So evth should be fine.

Trying to run without the alpaka::wait::wait(queue), I still sometimes get a segfault with ./HistoContainer_t.tbb. It is a spurious segfault, which makes it hard to know whether sth fixes it or not. It seems to be somehow stemming from a synchronization issue indeed, but the issue is not compulsory (should not) be here.
Hence will remove the alpaka::wait::wait(queue).

makortel · 2021-01-27T15:48:19Z

src/alpaka/AlpakaCore/HistoContainer.h

+
+      alpaka::queue::enqueue(queue,
+                             alpaka::kernel::createTaskKernel<Acc1>(workDiv, fillFromVector(), h, nh, v, offsets));
+      alpaka::wait::wait(queue);


Is this wait() really needed?

Same as before: end of host function scope.

makortel · 2021-01-27T16:43:23Z

src/alpaka/test/alpaka/HistoContainer_t.cc

+
+    alpaka::mem::view::copy(queue, v_d, v_buf, N);
+
+    alpaka::mem::view::set(queue, h_d, 0, 1u);


I don't see a corresponding memset in CUDA code.

Yes true, I have added it because while using h_d several times in a loop, we want to be sure that the next iteration starts with sth clean.
This was also causing segfault.

In practice, I would just declare h_d inside the loop (and in general obviously just keep variable declarations as close as possible to when they are used).
But here I would avoid changing too much the code in addition to the port to Alpaka, to be able to keep the portability 'portable'.

makortel · 2021-01-27T16:52:15Z

src/alpaka/test/alpaka/OneHistoContainer_t.cc

+
+    alpaka::queue::enqueue(
+        queue,
+        alpaka::kernel::createTaskKernel<Acc1>(workDiv, setZeroBins(), alpaka::mem::view::getPtrNative(hist_dbuf)));


In CUDA all of these are in a single kernel, separated by __syncthreads(). Why are they split here?

Yes here I have followed what was also done in the Kokkos version, since in this test did not care about perf.
Can also push a one-kernel version.

Ah, you followed Kokkos version. I don't remember why we split the code there. Probably doesn't matter much (for a unit test).

For memo: In the end, I have made the OneHistoContainer test based on a single kernel version, to be closest as possible to CUDA test version.

makortel · 2021-01-27T16:58:07Z

src/alpaka/test/alpaka/OneToManyAssoc_t.cc

+  alpaka::mem::view::copy(queue, v_dbuf, tr_hbuf, N);
+
+  auto a_dbuf = alpaka::mem::buf::alloc<Assoc, Idx>(device, 1u);
+  alpaka::mem::view::set(queue, a_dbuf, 0, 1u);


I don't see a corresponding memset in CUDA code.

Yes just ended up being cleaner.

makortel · 2021-01-27T17:14:11Z

src/alpaka/AlpakaCore/HistoContainer.h

+
+        uint32_t endElementIdx = endElementIdxNoStride[0u];
+        for (uint32_t threadIdx = firstElementIdxNoStride[0u]; threadIdx < nt; threadIdx += gridDimension) {
+          for (uint32_t i = threadIdx; i < std::min(endElementIdx, nt); ++i) {


I see this pattern

const auto gridDim = ...; const auto [firstNoStride, endNoStride] = ...; for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < MAX; threadIdx += gridDimension, endElementIdx += gridDimension) { for (auto i = threadIdx; i < std::min(endElementIdx, MAX); ++i ) { <body> } }

repeats in almost(?) every kernel in this PR. That's 4+ lines of repetitive and error prone code. I'm wondering if it would be worth to abstract that along

// the name should be more descriptive... template <typename T_Acc, typename N, typename Func> void for_each_element(const T_Acc& acc, const N nitems, Func func) { const auto gridDim = ...; const auto [firstNoStride, endNoStride] = ...; for (auto threadIdx = firstNoStride, endElementIdx = endNoStride; threadIdx < nitems; threadIdx += gridDimension, endElementIdx += gridDimension) { for (auto i = threadIdx; i < std::min(endElementIdx, nitems); ++i ) { func(i); } } }

that could be called here along

const uint32_t nt = offsets[nh]; for_each_element(acc, nt, [&](uint32_t i) { auto off = alpaka_std::upper_bound(offsets, offsets + nh + 1, i); ... }

? I know this starts to look like we would be building our own abstraction layer on top of Alpaka, but to me the boilerplace calls for something.

Written that I'm fine if this is left for a future PR.

Yes absolutely, was thinking marking this as a to-do comment for this PR.
Maybe in an additional PR indeed is better.

…All workdiv / function object / arguments info are copied anyway, and the owning pointer to the histogram is defined outside the function scope

…pend ALPAKA_ACCELERATOR_NAMESPACE when needed. Could also place entire callers within ALPAKA_ACCELERATOR_NAMESPACE namespace.

ghugo83 · 2021-01-28T16:01:03Z

Ok just removed the using namespace ALPAKA_ACCELERATOR_NAMESPACE; and prepended it when needed.
That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

ghugo83 · 2021-01-28T16:18:53Z

As a memo 2 points will be addressed outside of this PR:

Add an helper function to directly handle thread/element indices in the += gridDimension stride case.
Dig a bit more inside prefixScan.

makortel · 2021-01-28T20:58:53Z

src/alpaka/test/alpaka/HistoContainer_t.cc

@@ -8,10 +8,10 @@
 #include "AlpakaCore/alpakaWorkDivHelper.h"
 #include "AlpakaCore/HistoContainer.h"

-using namespace ALPAKA_ACCELERATOR_NAMESPACE;
-


Sorry, but actually here (and in other unit test files) the using namespace ... is ok. It's just in headers where it causes problems (although it does make the namespace of the various types more clear in source files as well).

Yes but at this stage, found it more consistent to remove it everywhere, rather than having a mix of 'using namespace' and ALPAKA_ACCELERATOR_NAMESPACE::

makortel · 2021-01-28T21:11:41Z

That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

Good question. In the Kokkos port (which has a similar structure) the aim was to either put functions in the ALPAKA_ACCELERATOR_NAMESPACE or use template parameters to distinguish with different Accelerators. Although looking at e.g. launchFinalize(), both approaches feel a bit unappealing.

makortel · 2021-01-28T22:45:57Z

That commit looks a bit ugly though somehow, if not ok let me know, I could also just include the caller functions within a ALPAKA_ACCELERATOR_NAMESPACE namespace

Good question. In the Kokkos port (which has a similar structure) the aim was to either put functions in the ALPAKA_ACCELERATOR_NAMESPACE or use template parameters to distinguish with different Accelerators. Although looking at e.g. launchFinalize(), both approaches feel a bit unappealing.

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

ghugo83 · 2021-01-29T09:58:11Z

Yes usually I would not have used 'using namespace', but here the fact that the namespace is acc-dependent feels a bit weird.
Or maybe this namespace could be renamed to sth shorter?
Like alpaka_acc::

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

Ok. Or why not templating on each possible Acc type?
But fully agree, not sure it is nicer than just using a namespace.

…g to call element_global_index_range for each possible max number of elements.

…elper function in cms::alpakatools, but this is already nicer.

…the number of blocks used in a prefix scan.

…which already makes things clearer. NB: TO DO: add a dedicated helper function.

ghugo83 · 2021-01-29T14:32:55Z

Ok for the OneHistoContainer test, just added the 1-kernel version instead, as it makes it easier to debug and compare with CUDA tests.

makortel · 2021-01-29T14:33:33Z

Yes usually I would not have used 'using namespace', but here the fact that the namespace is acc-dependent feels a bit weird.
Or maybe this namespace could be renamed to sth shorter?
Like alpaka_acc::

The ALPAKA_ACCELERATOR_NAMESPACE is a macro, that is set based on compiler arguments. I think it is good to keep it clear everywhere that it is a macro.

In the original idea (and what happens in Kokkos version) the ALPAKA_ACCELERATOR_NAMESPACE would have to written at most once per file along

namespace ALPAKA_ACCELERATOR_NAMESPACE {
  namespace cms::alpakatools {
  }
}

(which we could do here too)

I think the template approach would actually become feasible (with look&feel close to Kokkos version) if we'd have only one Accelerator type for each backend (#144).

Ok. Or why not templating on each possible Acc type?

I'm not really sure what you mean but I'll guess anyway. The annoyance stems from the function deciding the dimension of an index, and the caller of the function not having to know that. One option would be for the caller to give all the Acc1, Acc2, Acc3 types as three template parameters, but to me that would looks cumbersome as a general pattern. (if you meant something different, please elaborate)

Another option that came to my mind would be to define our own "tag type" for each accelerator, and then a traits class template that could be used to get all the AccN, DevAccN, PltfAccN based on a single template argument. But given other motivations to use only single Accelerator type per backend (#144), I'm not very enthusiastic on this option.

makortel · 2021-01-29T14:42:21Z

src/alpaka/AlpakaCore/alpakaWorkDivHelper.h

-    ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range(const T_Acc& acc,
-                                                                               const Vec<T_Dim>& maxNumberOfElements) {
+    template <typename T_Acc, typename T_Dim = alpaka::dim::Dim<T_Acc>>
+    ALPAKA_FN_ACC std::pair<Vec<T_Dim>, Vec<T_Dim>> element_global_index_range_uncut(const T_Acc& acc) {


I would name this function as element_global_index_range, and the other something along element_global_index_range_max (preferring to tell what a function does rather than what it does not do). But a minor point for a prototype.

ok no pb
yes was to insist on the risky version (the one which has not truncation with max), but yes the other way round can be more elegant. Will change it.

makortel · 2021-01-29T14:47:06Z

src/alpaka/AlpakaCore/HistoContainer.h

+
+      const int num_items = Histo::totbins();
+
+      auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items));


I'd still like to avoid this memory allocation (either by keeping Histo::psws or trying dynamic shared memory allocation in multiBlockPrefixScanSecondStep.

…nstead of global memory. This changes the prefixSCan interface (closer to CUDA version), hence the call sites. Important: To be noted is that in any case, the amount of memory needed was not num_items * sizeof(T), only num_blocks * sizeof(T) is sufficient.

…and element_global_index_range_truncated to compute range truncated by max number of elements of interest.

ghugo83 · 2021-02-01T10:38:20Z

but to me that would looks cumbersome as a general pattern

Yes you are right.

The ALPAKA_ACCELERATOR_NAMESPACE is a macro, that is set based on compiler arguments. I think it is good to keep it clear everywhere that it is a macro.

Yes just ended up having ALPAKA_ACCELERATOR_NAMESPACE::Queue, etc, in the end.

makortel · 2021-02-01T19:47:25Z

src/alpaka/AlpakaCore/HistoContainer.h

+    template <typename Histo>
+    ALPAKA_FN_HOST ALPAKA_FN_INLINE __attribute__((always_inline)) void launchFinalize(
+        Histo *__restrict__ h,
+        const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,


After removing the memory allocation, the device parameter is not needed anymore

Suggested change

const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,

makortel · 2021-02-01T19:47:46Z

src/alpaka/AlpakaCore/HistoContainer.h

+        uint32_t const *__restrict__ offsets,
+        uint32_t totSize,
+        unsigned int nthreads,
+        const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,


device parameter can be removed from here as well.

Suggested change

const ALPAKA_ACCELERATOR_NAMESPACE::DevAcc1 &device,

makortel · 2021-02-01T19:48:38Z

src/alpaka/AlpakaCore/HistoContainer.h

+      alpaka::queue::enqueue(queue,
+                             alpaka::kernel::createTaskKernel<ALPAKA_ACCELERATOR_NAMESPACE::Acc1>(
+                                 workDiv, countFromVector(), h, nh, v, offsets));
+      launchFinalize(h, device, queue);


Following removal of device.

Suggested change

launchFinalize(h, device, queue);

launchFinalize(h, queue);

Ha yes true, this argument is not necessary anymore now, thanks. I just removed it.

makortel · 2021-02-02T16:54:05Z

I created two issues #168 and #169 to remind some of the things discussed here.

ghugo83 added 19 commits January 19, 2021 16:01

Start porting HistoContainer to AlpakaCore

faaede0

Finish first-try porting of HistoContainer and its test

adb66eb

HistoContainer and its test now compiles properly. Still segfaults is…

f548e27

…sues

Tests now run smoothly and pass all assertions in serial and CUDA tes…

c4878d5

…t. Still issue in TBB case.

Seems to need to add alpaka::wait::wait(queue) around host function c…

d8c9ad6

…alls when queue is passed as an argument (even as a reference)

Can also place them inside the host function.

5bacd9c

Remove commented code

46805ad

[alpaka] Also offload h->off initialization, as is done for Kokkos. h…

5b7b5ab

…->psws also needs to be set

minor cleaning

0cd5c71

[alpaka] Add OneToManyAssoc_t test

e8c66b6

Change index variables names for strided access.

b465663

Minor fixes

c6e30b8

[alpaka] Add OneHistoContainer_t test

995af06

Fixes in OneHistoContainer_t

82ffd14

[alpaka] Important to initlize pointers properly, especially when loc…

c6477b6

…ated within a loop.

Important fix in OneHistoContainer_t test: correct strided access, no…

94f94e6

…w all assert pass and histo values are identical as with CUDA (for same input matrix v).

Also correct strided access in serial and TBB cases in HistoContainer…

2ff31f9

… and OneToManyAssoc_t

Simplify Vec construction

8e9e776

clang-format

8aaa1f3

makortel added the alpaka label Jan 27, 2021

makortel reviewed Jan 27, 2021

View reviewed changes

ghugo83 added 2 commits January 28, 2021 15:36

Remove alpaka::wait::wait(queue); before host function end of scope. …

b6e9adc

…All workdiv / function object / arguments info are copied anyway, and the owning pointer to the histogram is defined outside the function scope

Remove using namespace ALPAKA_ACCELERATOR_NAMESPACE, and directly pre…

94105db

…pend ALPAKA_ACCELERATOR_NAMESPACE when needed. Could also place entire callers within ALPAKA_ACCELERATOR_NAMESPACE namespace.

makortel reviewed Jan 28, 2021

View reviewed changes

OneHistoContainer: Add 1-kernel version

c08d205

ghugo83 added 5 commits January 29, 2021 13:26

Add cms::alpakatools::element_global_index_range_uncut to avoid havin…

88545ea

…g to call element_global_index_range for each possible max number of elements.

Include endElementIdx within loop. NB: Will need to add a dedicated h…

ba641ee

…elper function in cms::alpakatools, but this is already nicer.

Remove psws from HistoContainer class, never used. It corresponds to …

5c9a8b9

…the number of blocks used in a prefix scan.

Indices with += gridDimension stride: add endElementIdx within loop, …

94176e9

…which already makes things clearer. NB: TO DO: add a dedicated helper function.

clang-format

cbb609f

makortel reviewed Jan 29, 2021

View reviewed changes

ghugo83 added 3 commits February 1, 2021 11:05

Renaming: element_global_index_range to compute non-truncated range, …

c1824b8

…and element_global_index_range_truncated to compute range truncated by max number of elements of interest.

clang-format

52ccf49

minor cleaning

788556d

ghugo83 mentioned this pull request Feb 1, 2021

Adjustments in AlpakaCore/prefixScan and its test + Add helper functions to handle workdiv #167

Merged

makortel reviewed Feb 1, 2021

View reviewed changes

Forgot to remove device, now that memory allocation is removed

548b4b9

makortel merged commit 2f153e5 into cms-patatrack:master Feb 2, 2021

This was referenced Feb 2, 2021

How to use ALPAKA_ACCELERATOR_NAMESPACE in headers? #168

Open

Add an helper function to directly handle thread/element indices in strided case #169

Open


		const int num_items = Histo::totbins();

		auto psum_dBuf = alpaka::mem::buf::alloc<uint32_t, Idx>(device, Vec1::all(num_items));


		alpaka::mem::view::copy(queue, v_d, v_buf, N);

		alpaka::mem::view::set(queue, h_d, 0, 1u);

Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165

Add AlpakaCore/HistoContainer.h + HistoContainer_t, OneHistoContainer_t and OneToManyAssoc_t tests #165

Conversation

ghugo83 commented Jan 27, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

ghugo83 Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel Jan 28, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

ghugo83 Jan 27, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 commented Jan 28, 2021

ghugo83 commented Jan 28, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel commented Jan 28, 2021

makortel commented Jan 28, 2021

ghugo83 commented Jan 29, 2021 • edited Loading

ghugo83 commented Jan 29, 2021

makortel commented Jan 29, 2021

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ghugo83 commented Feb 1, 2021 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

makortel commented Feb 2, 2021

ghugo83 Jan 27, 2021 •

edited

Loading

ghugo83 Jan 27, 2021 •

edited

Loading

ghugo83 Jan 27, 2021 •

edited

Loading

makortel Jan 28, 2021 •

edited

Loading

ghugo83 Feb 1, 2021 •

edited

Loading

ghugo83 Jan 27, 2021 •

edited

Loading

ghugo83 Jan 27, 2021 •

edited

Loading

ghugo83 commented Jan 29, 2021 •

edited

Loading

ghugo83 commented Feb 1, 2021 •

edited

Loading