forked from cms-sw/cmssw
-
Notifications
You must be signed in to change notification settings - Fork 5
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Next prototype of the framework integration
- Loading branch information
Showing
31 changed files
with
1,849 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,283 @@ | ||
# Next iteration of the prototype for CMSSW interface to heterogeneous algorithms | ||
|
||
## Introduction | ||
|
||
The current prototype with `HeterogeneousEDProducer` and | ||
`HeterogeneousProduct` is documented [here](../Producer/README.md). | ||
The main differences wrt. that are | ||
* Split device-specific code to different EDProducers | ||
* Plug components together in the configuration | ||
|
||
This page documents the CUDA integration, and discusses briefly on how | ||
to extend to other devices. It will be extended if/when it gets | ||
deployed and `HeterogeneousEDProducer` retired. | ||
|
||
## Sub-packages | ||
* [`CUDACore`](#cuda-integration) CUDA-specific core components | ||
* [`CUDAServices`](../CUDAServices) Various edm::Services related to CUDA | ||
* [`CUDAUtilities`](../CUDAUtilities) Various utilities for CUDA kernel code | ||
|
||
# CUDA integration | ||
|
||
## Choosing device | ||
|
||
The device choosing logic is split to an EDProducer, an EDFilter, and | ||
use of Paths in the configuration. | ||
|
||
First, a `CUDADeviceChooser` EDProducer is run. It has the logic to | ||
device whether the following chain of EDModules should run on a CUDA | ||
device or not, and if yes, on which CUDA device. If it decides "yes", | ||
it produces a `CUDAToken`, which contains the device id and a CUDA | ||
stream. If it decides "no", it does not produce anything. | ||
|
||
Next step is a `CUDADeviceFilter` EDFilter. It checks whether the | ||
`CUDADeviceChooser` produced a product or not. If "yes", it returns | ||
`true`, and if "no", it returns `false`. | ||
|
||
Finally, the pieces need to be put together in the configuration. The | ||
`CUDADeviceChooser` can be "anywhere", but the `CUDADeviceFilter` | ||
should be the first module on a `cms.Path`, followed by the CUDA | ||
EDProducers (in the future it may become sufficient to have only the | ||
first EDProducer of a chain in the `Path`). | ||
```python | ||
process.fooCUDADevice = cms.EDProducer("CUDADeviceChooser") | ||
process.fooCUDADeviceFilter = cms.EDFilter("CUDADeviceFilter", | ||
src = cms.InputTag("fooCUDADevice") | ||
) | ||
process.fooCUDA = cms.EDProducer("FooProducerCUDA") | ||
process.fooPathCUDA = cms.Path( | ||
process.fooCUDADeviceFilter + process.fooCUDA | ||
) | ||
process.fooTask = cms.Task( | ||
process.fooDevice | ||
) | ||
``` | ||
|
||
## Data model | ||
|
||
The GPU data can be a single pointer to device data, or a class/struct | ||
containing such pointers (among other stuff). When putting the data to | ||
event, the data is wrapped to `CUDA<T>` template, which holds | ||
* the GPU data | ||
* must be movable, but no other restrictions (except need to be able to generate ROOT dictionaries from it) | ||
* the current device where the data was produced, and the CUDA stream the data was produced with | ||
* [CUDA event for synchronization between multiple CUDA streams](#synchronizing-between-cuda-streams) | ||
|
||
Note that the `CUDA<T>` wrapper can be constructed only with | ||
`CUDAScopedContext::wrap()`, and the data `T` can be obtained from it | ||
only with `CUDAScopedContext::get()`, as described further below. | ||
|
||
## CUDA EDProducer | ||
|
||
### Class declaration | ||
|
||
For time being (may disappear in the future) a CUDA producer should | ||
inherit from `CUDAStreamEDProducer<...>`. The template parameters are | ||
the usual | ||
[stream producer extensions](https://twiki.cern.ch/twiki/bin/view/CMSPublic/FWMultithreadedFrameworkStreamModuleInterface#Template_Arguments). | ||
Note that contrary to `HeterogeneousEDProducer`, the `ExternalWork` | ||
extension is **not** implied. | ||
|
||
```cpp | ||
#include "HeterogeneousCore/CUDACore/interface/CUDAStreamEDProducer.h" | ||
class FooProducerCUDA: public CUDAStreamEDProducer<> { | ||
... | ||
``` | ||
### Memory allocation | ||
The only effect of the `CUDAStreamEDProducer` base class is that | ||
`beginStream(edm::StreamID)` is replaced with | ||
`beginStreamCUDA(edm::StreamID)`. This is done in order to set the | ||
current CUDA device before the user code starts. **If the algorithm | ||
has to allocate memory buffers for the duration of the whole job, the | ||
recommended place is here.** Note that a CUDA stream is not passed to | ||
the user code. If a CUDA stream is really needed, the developer should | ||
create+synchronize it by him/herself. (although if this appears to be | ||
common practice, we should try to provide the situation somehow) | ||
### Setting the current device | ||
A CUDA producer should read either `CUDAToken` (from | ||
`CUDADeviceChooser`) or one or more `CUDA<T>` products. Then, in the | ||
`acquire()`/`produce()`, it should construct `CUDAScopedContext` from | ||
one of them | ||
```cpp | ||
// From CUDAToken | ||
edm::Handle<CUDAToken> htoken; | ||
iEvent.getByToken(srcToken_, htoken); | ||
auto ctx = CUDAScopedContext(*htoken); | ||
/// From CUDA<T> | ||
edm::Handle<CUDA<GPUClusters> > handle; | ||
iEvent.getByToken(srctoken_, handle); | ||
auto ctx = CUDAScopedContext(*handle); | ||
``` | ||
|
||
`CUDAScopedContext` works in the RAII way and does the following | ||
* Sets the current device (for the scope) from `CUDAToken`/`CUDA<T>` | ||
* Gives access to the CUDA stream the algorithm should use to queue asynchronous work | ||
* Calls `edm::WaitingTaskWithArenaHolder::doneWaiting()` when necessary | ||
* [Synchronizes between CUDA streams if necessary](#synchronizing-between-cuda-streams) | ||
* Needed to get/put `CUDA<T>` from/to the event | ||
|
||
In case of multiple input products, from possibly different CUDA | ||
streams and/or CUDA devices, this approach gives the developer full | ||
control in which of them the kernels of the algorithm should be run. | ||
|
||
### Getting input | ||
|
||
The real product (`T`) can be obtained from `CUDA<T>` only with the | ||
help of `CUDAScopedContext`. | ||
|
||
```cpp | ||
edm::Handle<CUDA<GPUClusters> > hclus; | ||
iEvent.getByToken(srctoken_, hclus); | ||
GPUClusters const& clus = ctx.get(*hclus); | ||
``` | ||
|
||
This step is needed to | ||
* check that the data are on the same CUDA device | ||
* if not, throw an exception (with unified memory could prefetch instead) | ||
* if the CUDA streams are different, synchronize between them | ||
|
||
### Calling the CUDA kernels | ||
|
||
There is nothing special, except the CUDA stream can be obtained from | ||
the `CUDAScopedContext` | ||
|
||
```cpp | ||
gpuAlgo.makeClustersAsync(..., ctx.stream()); | ||
``` | ||
|
||
### Putting output | ||
|
||
The GPU data needs to be wrapped to `CUDA<T>` template with `CUDAScopedContest.wrap()` | ||
|
||
```cpp | ||
GPUClusters clusters = gpuAlgo.makeClustersAsync(..., ctx.stream()); | ||
std::unique_ptr<CUDA<GPUClusters> > ret = ctx.wrap(clusters); | ||
iEvent.put(std::move(ret)); | ||
|
||
// or with one line | ||
iEvent.put(ctx.wrap(gpuAlgo.makeClustersAsync(ctx.stream()))); | ||
``` | ||
|
||
This step is needed to | ||
* store the current device and CUDA stream into `CUDA<T>` | ||
* record the CUDA event needed for CUDA stream synchronization | ||
|
||
### `ExternalWork` extension | ||
|
||
Everything above works both with and without `ExternalWork`. | ||
|
||
Without `ExternalWork` the `EDProducer`s act similar to TBB | ||
flowgraph's "streaming node". I.e. they just queue more asynchronous | ||
work in their `produce()`. | ||
|
||
The `ExternalWork` is needed when one would otherwise call | ||
`cudeStreamSynchronize()`, e.g. transferring something to CPU needed | ||
for downstream DQM, or to queue more asynchronous work. With | ||
`ExternalWork` an `acquire()` method needs to be implemented that gets | ||
an `edm::WaitingTaskWithArenaHolder` parameter. The | ||
`WaitingTaskWithArenaHolder` should then be passed to the constructor | ||
of `CUDAScopedContext` along | ||
|
||
```cpp | ||
void acquire(..., edm::WaitingTaskWithArenaHolder waitingTaskHolder) { | ||
edm::Handle<CUDA<GPUClusters> > handle; | ||
iEvent.getByToken(token_, handle); | ||
auto ctx = CUDAScopedContext(*handle, std::move(waitingTaskHolder)); // can also copy instead of move if waitingTaskHolder is needed for something else as well | ||
... | ||
``` | ||
When constructed this way, `CUDAScopedContext` registers a callback | ||
function to the CUDA stream in its destructor to call | ||
`waitingTaskHolder.doneWaiting()`. | ||
A GPU->GPU producer needs a `CUDAScopedContext` also in its | ||
`produce()`. Currently the best way is to read the input again in | ||
`produce()` and construct the `CUDAScopedContext` from there. This | ||
point will be improved. | ||
### Transferring GPU data to CPU | ||
The GPU->CPU data transfer needs synchronization to ensure the CPU | ||
memory to have all data before putting that to the event. This means | ||
the `ExternalWork` needs to be used along | ||
* In `acquire()` | ||
* (allocate CPU memory buffers) | ||
* Queue all GPU->CPU transfers asynchronously | ||
* In `produce()` | ||
* If needed, read additional CPU products (e.g. from `edm::Ref`s) | ||
* Reformat data back to legacy data formats | ||
* Note: `CUDAScopedContext` is **not** needed in in `produce()` | ||
### Synchronizing between CUDA streams | ||
In case the producer needs input data that were produced in two (or | ||
more) CUDA streams, these streams have to be synchronized (since CMSSW | ||
framework no longer guarantees the synchronization as was the case | ||
with `HeterogeneousEDProducer`). Here this synchronization is achieved | ||
with CUDA events. | ||
Each `CUDA<T>` constains also a CUDA event object. The call to | ||
`CUDAScopedContext::wrap()` will *record* the event in the CUDA stream. | ||
This means that when all work queued to the CUDA stream up to that | ||
point has been finished, the CUDA event becomes *occurred*. Then, in | ||
`CUDAScopedContext::get()`, if the `CUDA<T>` to get from has a | ||
different CUDA stream than the `CUDAScopedContext`, | ||
`cudaStreamWaitEvent(stream, event)` is called. This means that all | ||
subsequent work queued to the CUDA stream will wait for the CUDA event | ||
to become occurred. Therefore this subsequent work can assume that the | ||
to-be-getted CUDA product exists. | ||
## Configuration | ||
```python | ||
process.fooCPU = cms.EDProducer("FooProducer") # legacy CPU | ||
process.fooCUDADevice = cms.EDProducer("CUDADeviceChooser") | ||
process.fooCUDADeviceFilter = cms.EDFilter("CUDADeviceFilter", | ||
src = cms.InputTag("fooCUDADevice") | ||
) | ||
process.fooCUDA = cms.EDProducer("FooProducerCUDA") | ||
process.fooFromCUDA = cms.EDProducer("FooProducerCUDAtoCPU", src="fooCUDA") | ||
process.foo = cms.EDProducer("FooProducerFallback", | ||
src = cms.VInputTag("fooFromCUDA", "fooCPU") | ||
) | ||
process.fooPathCUDA = cms.Path( | ||
process.fooCUDADeviceFilter + process.fooCUDA | ||
) | ||
process.fooPathCPU = cms.Path( | ||
~process.fooCUDADeviceFilter + process.fooCPU | ||
) | ||
process.fooTask = cms.Task( | ||
process.fooDevice, | ||
process.fooFromCUDA, | ||
process.foo | ||
) | ||
... | ||
``` | ||
For a more complete example, see [here](test/testCUDA_cfg.py). | ||
|
||
# Extension to other devices | ||
|
||
The C++ side extends in a straightforward way. One has to add classes | ||
similar to `CUDAToken`, `CUDA<T>`, and `CUDAScopedContext`. Of course, | ||
much depends on the exact details. The python configuration side | ||
extends as well, one "just" has to add more modules there. Also the | ||
device choosing logic is also extendable | ||
```python | ||
process.fooCUDADevice = ... | ||
process.fooFPGADevice = ... | ||
process.fooPathCUDA = cms.Path( | ||
process.fooCUDADeviceFilter + ... | ||
) | ||
process.fooPathFPGA = cms.Path( | ||
~process.fooCUDADeviceFilter + process.fooFPGADeviceFilter + ... | ||
) | ||
process.fooPathCPU = cms.Path( | ||
~process.fooCUDADeviceFilter + ~process.fooFPGADeviceFilter + ... | ||
) | ||
``` |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,77 @@ | ||
#ifndef HeterogeneousCore_CUDACore_CUDA_h | ||
#define HeterogeneousCore_CUDACore_CUDA_h | ||
|
||
#include <optional> | ||
|
||
#include <cuda/api_wrappers.h> | ||
|
||
/** | ||
* The purpose of this class is to wrap CUDA data to edm::Event in a | ||
* way which forces correct use of various utilities. | ||
* | ||
* The non-default construction has to be done with CUDAScopedContext | ||
* (in order to properly register the CUDA event). | ||
* | ||
* The default constructor is needed only for the ROOT dictionary generation. | ||
* | ||
* The CUDA event is in practice needed only for stream-stream | ||
* synchronization, but someone with long-enough lifetime has to own | ||
* it. Here is a somewhat natural place. If overhead is too much, we | ||
* can e.g. make CUDAService own them (creating them on demand) and | ||
* use them only where synchronization between streams is needed. | ||
*/ | ||
template <typename T> | ||
class CUDA { | ||
public: | ||
CUDA() = default; // Needed only for ROOT dictionary generation | ||
|
||
CUDA(const CUDA&) = delete; | ||
CUDA& operator=(const CUDA&) = delete; | ||
CUDA(CUDA&&) = default; | ||
CUDA& operator=(CUDA&&) = default; | ||
|
||
bool isValid() const { return streamEvent_.get() != nullptr; } | ||
|
||
int device() const { return device_; } | ||
|
||
const cuda::stream_t<>& stream() const { return streamEvent_->stream; } | ||
cuda::stream_t<>& stream() { return streamEvent_->stream; } | ||
|
||
const cuda::event_t& event() const { return streamEvent_->event; } | ||
cuda::event_t& event() { return streamEvent_->event; } | ||
|
||
private: | ||
friend class CUDAScopedContext; | ||
friend class TestCUDA; | ||
|
||
template <typename TokenOrContext> | ||
explicit CUDA(T data, const TokenOrContext& token): | ||
streamEvent_(std::make_unique<StreamEvent>(token)), | ||
data_(std::move(data)), | ||
device_(token.device()) | ||
{} | ||
|
||
// Using unique_ptr to support the default constructor. Tried | ||
// std::optional, but cuda::stream_t and cuda::event_t have their | ||
// move assignment operators deleted. Use a struct to save one | ||
// memory allocation. | ||
public: // need to be public for ROOT dicrionary generation? | ||
struct StreamEvent { | ||
template <typename TokenOrContext> | ||
explicit StreamEvent(const TokenOrContext& token): | ||
stream(token.stream()), | ||
event(cuda::event::create(token.device(), | ||
cuda::event::sync_by_busy_waiting, // default; we should try to avoid explicit synchronization, so maybe the value doesn't matter much? | ||
cuda::event::dont_record_timings)) // it should be a bit faster to ignore timings | ||
{} | ||
|
||
cuda::stream_t<> stream; // stream_t is just a handle, the real CUDA stream is owned by CUDAToken (with long-enough life time) | ||
cuda::event_t event; | ||
}; | ||
private: | ||
std::unique_ptr<StreamEvent> streamEvent_; | ||
T data_; | ||
int device_ = -1; | ||
}; | ||
|
||
#endif |
Oops, something went wrong.