Next prototype of the framework integration

cms-patatrack · Aug 6, 2018 · 741388b · 741388b
1 parent c2aba96
commit 741388b
Show file tree

Hide file tree

Showing 31 changed files with 1,867 additions and 0 deletions.
diff --git a/HeterogeneousCore/CUDACore/BuildFile.xml b/HeterogeneousCore/CUDACore/BuildFile.xml
@@ -1,10 +1,12 @@
+<use name="FWCore/Concurrency"/>
 <use name="FWCore/Framework"/>
 <use name="FWCore/ServiceRegistry"/>
 <use name="FWCore/ParameterSet"/>
 <use name="DataFormats/Provenance"/>
 <use name="HeterogeneousCore/CUDAServices"/>
 <use name="cuda"/>
 <use name="cuda-api-wrappers"/>
+<use name="root"/>
 
 <export>
     <lib name="1"/>

diff --git a/HeterogeneousCore/CUDACore/README.md b/HeterogeneousCore/CUDACore/README.md
@@ -0,0 +1,283 @@
+# Next iteration of the prototype for CMSSW interface to heterogeneous algorithms
+
+## Introduction
+
+The current prototype with `HeterogeneousEDProducer` and
+`HeterogeneousProduct` is documented [here](../Producer/README.md).
+The main differences wrt. that are
+* Split device-specific code to different EDProducers
+* Plug components together in the configuration
+
+This page documents the CUDA integration, and discusses briefly on how
+to extend to other devices. It will be extended if/when it gets
+deployed and `HeterogeneousEDProducer` retired.
+
+## Sub-packages
+* [`CUDACore`](#cuda-integration) CUDA-specific core components
+* [`CUDAServices`](../CUDAServices) Various edm::Services related to CUDA
+* [`CUDAUtilities`](../CUDAUtilities) Various utilities for CUDA kernel code
+
+# CUDA integration
+
+## Choosing device
+
+The device choosing logic is split to an EDProducer, an EDFilter, and
+use of Paths in the configuration.
+
+First, a `CUDADeviceChooser` EDProducer is run. It has the logic to
+device whether the following chain of EDModules should run on a CUDA
+device or not, and if yes, on which CUDA device. If it decides "yes",
+it produces a `CUDAToken`, which contains the device id and a CUDA
+stream. If it decides "no", it does not produce anything.
+
+Next step is a `CUDADeviceFilter` EDFilter. It checks whether the
+`CUDADeviceChooser` produced a product or not. If "yes", it returns
+`true`, and if "no", it returns `false`.
+
+Finally, the pieces need to be put together in the configuration. The
+`CUDADeviceChooser` can be "anywhere", but the `CUDADeviceFilter`
+should be the first module on a `cms.Path`, followed by the CUDA
+EDProducers (in the future it may become sufficient to have only the
+first EDProducer of a chain in the `Path`).
+```python
+process.fooCUDADevice = cms.EDProducer("CUDADeviceChooser")
+process.fooCUDADeviceFilter = cms.EDFilter("CUDADeviceFilter",
+    src = cms.InputTag("fooCUDADevice")
+)
+process.fooCUDA = cms.EDProducer("FooProducerCUDA")
+process.fooPathCUDA = cms.Path(
+    process.fooCUDADeviceFilter + process.fooCUDA
+)
+process.fooTask = cms.Task(
+    process.fooDevice
+)
+```
+
+## Data model
+
+The GPU data can be a single pointer to device data, or a class/struct
+containing such pointers (among other stuff). When putting the data to
+event, the data is wrapped to `CUDA<T>` template, which holds
+* the GPU data
+  * must be movable, but no other restrictions (except need to be able to generate ROOT dictionaries from it)
+* the current device where the data was produced, and the CUDA stream the data was produced with
+* [CUDA event for synchronization between multiple CUDA streams](#synchronizing-between-cuda-streams)
+
+Note that the `CUDA<T>` wrapper can be constructed only with
+`CUDAScopedContext::wrap()`, and the data `T` can be obtained from it
+only with `CUDAScopedContext::get()`, as described further below.
+
+## CUDA EDProducer
+
+### Class declaration
+
+For time being (may disappear in the future) a CUDA producer should
+inherit from `CUDAStreamEDProducer<...>`. The template parameters are
+the usual
+[stream producer extensions](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkStreamModuleInterface#Template_Arguments).
+Note that contrary to `HeterogeneousEDProducer`, the `ExternalWork`
+extension is **not** implied.
+
+```cpp
+#include "HeterogeneousCore/CUDACore/interface/CUDAStreamEDProducer.h"
+class FooProducerCUDA: public CUDAStreamEDProducer<> {
+  ...
+```
+
+### Memory allocation
+
+The only effect of the `CUDAStreamEDProducer` base class is that
+`beginStream(edm::StreamID)` is replaced with
+`beginStreamCUDA(edm::StreamID)`. This is done in order to set the
+current CUDA device before the user code starts. **If the algorithm
+has to allocate memory buffers for the duration of the whole job, the
+recommended place is here.** Note that a CUDA stream is not passed to
+the user code. If a CUDA stream is really needed, the developer should
+create+synchronize it by him/herself. (although if this appears to be
+common practice, we should try to provide the situation somehow)
+
+### Setting the current device
+
+A CUDA producer should read either `CUDAToken` (from
+`CUDADeviceChooser`) or one or more `CUDA<T>` products. Then, in the
+`acquire()`/`produce()`, it should construct `CUDAScopedContext` from
+one of them
+```cpp
+// From CUDAToken
+edm::Handle<CUDAToken> htoken;
+iEvent.getByToken(srcToken_, htoken);
+auto ctx = CUDAScopedContext(*htoken);
+
+/// From CUDA<T>
+edm::Handle<CUDA<GPUClusters> > handle;
+iEvent.getByToken(srctoken_, handle);
+auto ctx = CUDAScopedContext(*handle);
+```
+
+`CUDAScopedContext` works in the RAII way and does the following
+* Sets the current device (for the scope) from `CUDAToken`/`CUDA<T>`
+* Gives access to the CUDA stream the algorithm should use to queue asynchronous work
+* Calls `edm::WaitingTaskWithArenaHolder::doneWaiting()` when necessary
+* [Synchronizes between CUDA streams if necessary](#synchronizing-between-cuda-streams)
+* Needed to get/put `CUDA<T>` from/to the event
+
+In case of multiple input products, from possibly different CUDA
+streams and/or CUDA devices, this approach gives the developer full
+control in which of them the kernels of the algorithm should be run.
+
+### Getting input
+
+The real product (`T`) can be obtained from `CUDA<T>` only with the
+help of `CUDAScopedContext`. 
+
+```cpp
+edm::Handle<CUDA<GPUClusters> > hclus;
+iEvent.getByToken(srctoken_, hclus);
+GPUClusters const& clus = ctx.get(*hclus);
+```
+
+This step is needed to
+* check that the data are on the same CUDA device
+  * if not, throw an exception (with unified memory could prefetch instead)
+* if the CUDA streams are different, synchronize between them
+
+### Calling the CUDA kernels
+
+There is nothing special, except the CUDA stream can be obtained from
+the `CUDAScopedContext`
+
+```cpp
+gpuAlgo.makeClustersAsync(..., ctx.stream());
+```
+
+### Putting output
+
+The GPU data needs to be wrapped to `CUDA<T>` template with `CUDAScopedContest.wrap()`
+
+```cpp
+GPUClusters clusters = gpuAlgo.makeClustersAsync(..., ctx.stream());
+std::unique_ptr<CUDA<GPUClusters> > ret = ctx.wrap(clusters);
+iEvent.put(std::move(ret));
+
+// or with one line
+iEvent.put(ctx.wrap(gpuAlgo.makeClustersAsync(ctx.stream())));
+```
+
+This step is needed to
+* store the current device and CUDA stream into `CUDA<T>`
+* record the CUDA event needed for CUDA stream synchronization
+
+### `ExternalWork` extension
+
+Everything above works both with and without `ExternalWork`.
+
+Without `ExternalWork` the `EDProducer`s act similar to TBB
+flowgraph's "streaming node". I.e. they just queue more asynchronous
+work in their `produce()`.
+
+The `ExternalWork` is needed when one would otherwise call
+`cudeStreamSynchronize()`, e.g. transferring something to CPU needed
+for downstream DQM, or to queue more asynchronous work. With
+`ExternalWork` an `acquire()` method needs to be implemented that gets
+an `edm::WaitingTaskWithArenaHolder` parameter. The
+`WaitingTaskWithArenaHolder` should then be passed to the constructor
+of `CUDAScopedContext` along
+
+```cpp
+void acquire(..., edm::WaitingTaskWithArenaHolder waitingTaskHolder) {
+  edm::Handle<CUDA<GPUClusters> > handle;
+  iEvent.getByToken(token_, handle);
+  auto ctx = CUDAScopedContext(*handle, std::move(waitingTaskHolder)); // can also copy instead of move if waitingTaskHolder is needed for something else as well
+  ...
+```
+
+When constructed this way, `CUDAScopedContext` registers a callback
+function to the CUDA stream in its destructor to call
+`waitingTaskHolder.doneWaiting()`.
+
+A GPU->GPU producer needs a `CUDAScopedContext` also in its
+`produce()`. Currently the best way is to read the input again in
+`produce()` and construct the `CUDAScopedContext` from there. This
+point will be improved. 
+
+### Transferring GPU data to CPU
+
+The GPU->CPU data transfer needs synchronization to ensure the CPU
+memory to have all data before putting that to the event. This means
+the `ExternalWork` needs to be used along
+* In `acquire()`
+  * (allocate CPU memory buffers)
+  * Queue all GPU->CPU transfers asynchronously
+* In `produce()`
+  * If needed, read additional CPU products (e.g. from `edm::Ref`s)
+  * Reformat data back to legacy data formats
+  * Note: `CUDAScopedContext` is **not** needed in in `produce()`
+
+### Synchronizing between CUDA streams
+
+In case the producer needs input data that were produced in two (or
+more) CUDA streams, these streams have to be synchronized (since CMSSW
+framework no longer guarantees the synchronization as was the case
+with `HeterogeneousEDProducer`). Here this synchronization is achieved
+with CUDA events.
+
+Each `CUDA<T>` constains also a CUDA event object. The call to
+`CUDAScopedContext::wrap()` will *record* the event in the CUDA stream.
+This means that when all work queued to the CUDA stream up to that
+point has been finished, the CUDA event becomes *occurred*. Then, in
+`CUDAScopedContext::get()`, if the `CUDA<T>` to get from has a
+different CUDA stream than the `CUDAScopedContext`,
+`cudaStreamWaitEvent(stream, event)` is called. This means that all
+subsequent work queued to the CUDA stream will wait for the CUDA event
+to become occurred. Therefore this subsequent work can assume that the
+to-be-getted CUDA product exists.
+
+## Configuration
+
+```python
+process.fooCPU = cms.EDProducer("FooProducer") # legacy CPU
+
+process.fooCUDADevice = cms.EDProducer("CUDADeviceChooser")
+process.fooCUDADeviceFilter = cms.EDFilter("CUDADeviceFilter",
+    src = cms.InputTag("fooCUDADevice")
+)
+process.fooCUDA = cms.EDProducer("FooProducerCUDA")
+process.fooFromCUDA = cms.EDProducer("FooProducerCUDAtoCPU", src="fooCUDA")
+process.foo = cms.EDProducer("FooProducerFallback",
+    src = cms.VInputTag("fooFromCUDA", "fooCPU")
+)
+process.fooPathCUDA = cms.Path(
+    process.fooCUDADeviceFilter + process.fooCUDA
+)
+process.fooPathCPU = cms.Path(
+    ~process.fooCUDADeviceFilter + process.fooCPU
+)
+process.fooTask = cms.Task(
+    process.fooDevice,
+    process.fooFromCUDA,
+    process.foo
+)
+...
+```
+For a more complete example, see [here](test/testCUDA_cfg.py).
+
+# Extension to other devices
+
+The C++ side extends in a straightforward way. One has to add classes
+similar to `CUDAToken`, `CUDA<T>`, and `CUDAScopedContext`. Of course,
+much depends on the exact details. The python configuration side
+extends as well, one "just" has to add more modules there. Also the
+device choosing logic is also extendable
+```python
+process.fooCUDADevice = ...
+process.fooFPGADevice = ...
+process.fooPathCUDA = cms.Path(
+    process.fooCUDADeviceFilter + ...
+)
+process.fooPathFPGA = cms.Path(
+    ~process.fooCUDADeviceFilter + process.fooFPGADeviceFilter + ...
+)    
+process.fooPathCPU = cms.Path(
+    ~process.fooCUDADeviceFilter + ~process.fooFPGADeviceFilter + ...
+)
+```
diff --git a/HeterogeneousCore/CUDACore/interface/CUDA.h b/HeterogeneousCore/CUDACore/interface/CUDA.h
@@ -0,0 +1,77 @@
+#ifndef HeterogeneousCore_CUDACore_CUDA_h
+#define HeterogeneousCore_CUDACore_CUDA_h
+
+#include <optional>
+
+#include <cuda/api_wrappers.h>
+
+/**
+ * The purpose of this class is to wrap CUDA data to edm::Event in a
+ * way which forces correct use of various utilities.
+ *
+ * The non-default construction has to be done with CUDAScopedContext
+ * (in order to properly register the CUDA event).
+ *
+ * The default constructor is needed only for the ROOT dictionary generation.
+ *
+ * The CUDA event is in practice needed only for stream-stream
+ * synchronization, but someone with long-enough lifetime has to own
+ * it. Here is a somewhat natural place. If overhead is too much, we
+ * can e.g. make CUDAService own them (creating them on demand) and
+ * use them only where synchronization between streams is needed.
+ */
+template <typename T>
+class CUDA {
+public:
+  CUDA() = default; // Needed only for ROOT dictionary generation
+
+  CUDA(const CUDA&) = delete;
+  CUDA& operator=(const CUDA&) = delete;
+  CUDA(CUDA&&) = default;
+  CUDA& operator=(CUDA&&) = default;
+
+  bool isValid() const { return streamEvent_.get() != nullptr; }
+
+  int device() const { return device_; }
+
+  const cuda::stream_t<>& stream() const { return streamEvent_->stream; }
+  cuda::stream_t<>& stream() { return streamEvent_->stream; }
+
+  const cuda::event_t& event() const { return streamEvent_->event; }
+  cuda::event_t& event() { return streamEvent_->event; }
+
+private:
+  friend class CUDAScopedContext;
+  friend class TestCUDA;
+
+  template <typename TokenOrContext>
+  explicit CUDA(T data, const TokenOrContext& token):
+    streamEvent_(std::make_unique<StreamEvent>(token)),
+    data_(std::move(data)),
+    device_(token.device())
+  {}
+
+  // Using unique_ptr to support the default constructor. Tried
+  // std::optional, but cuda::stream_t and cuda::event_t have their
+  // move assignment operators deleted. Use a struct to save one
+  // memory allocation.
+public: // need to be public for ROOT dicrionary generation?
+  struct StreamEvent {
+    template <typename TokenOrContext>
+    explicit StreamEvent(const TokenOrContext& token):
+      stream(token.stream()),
+      event(cuda::event::create(token.device(),
+                                cuda::event::sync_by_busy_waiting, // default; we should try to avoid explicit synchronization, so maybe the value doesn't matter much?
+                                cuda::event::dont_record_timings)) // it should be a bit faster to ignore timings
+    {}
+
+    cuda::stream_t<> stream; // stream_t is just a handle, the real CUDA stream is owned by CUDAToken (with long-enough life time)
+    cuda::event_t event;
+  };
+private:
+  std::unique_ptr<StreamEvent> streamEvent_;
+  T data_;
+  int device_ = -1;
+};
+
+#endif