Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Next prototype of the framework integration #100

Merged

Conversation

makortel
Copy link

@makortel makortel commented Jul 23, 2018

Despite of my original plan of not proceeding with framework side before the demonstrator, here is a prototype of the CUDA algorithm integration based on my discussions with @Dr15Jones and @wddgit. See the included README.md for more technical details.

I'm marking the PR with RFC because we need to discuss first on the details and understand whether it could make sense to deploy it already for the demonstrator. Otherwise the PR can serve as a discussion forum on the topic until the demonstrator is finished.

  • Pros wrt. HeterogeneousEDProducer
    • Very flexible
    • Provides "streaming node" functionality out of the box
    • GPU->CPU transfers done on demand also when need to make edm::Refs
    • Allows running both CPU and CUDA versions of the algorithms in the same job
      • E.g. for validation/debugging
    • Simpler code (both infrastructure and use)
  • Cons
    • More verbose and more boilerplate
    • Need to add cms.Paths in the cff files (not needed with `SwitchProducer)
    • Duplication in module customizations

Fixes #133.

@felicepantaleo @fwyzard @VinInn @rovere

@makortel
Copy link
Author

makortel commented Jul 24, 2018

Summarizing here the outcome of the meeting today (*). We chose to continue with the HeterogeneousEDProducer for the demonstrator, and leave this PR open for discussion (and possible further development) for that time. Questions raised:

  1. Can CUDADeviceChooser and CUDADeviceFilter be combined?
  2. The configuration side is awful, especially for the end-developer side. One way to seek for improvement would be to think first how one would like to configure, and then think how to implement that.
    • The main problem lies in all the boilerplate of the pattern needed for "GPU or CPU"
  3. Adding new Paths in the configuration doesn't (currently) work with HLT
    • HLT is still run in scheduled mode (with process.Schedule and Paths containing all producers)
    • Configuration editing system does not currently support Tasks etc
    • Dislike non-physics Paths

(*) https://indico.cern.ch/event/746161/contributions/3084531/attachments/1692036/2722511/slides_mk_20180724.pdf

@makortel
Copy link
Author

makortel commented Aug 7, 2018

Rebased on top of head of CMSSW_10_2_X_Patatrack (c2aba96).

Regarding the point 1 in #100 (comment), the choice of splitting the logic to an EDProducer and an EDFilter was based on earlier experience that usually (though not always) combining them leads to problems. Thinking further this particular case

  • the producer+filter functionality is "instruct the downstream to run on GPU if possible, otherwise in CPU", while
  • the producer-only functionality is "instruct the downstream to run on GPU, if not possible, throw an error"

so it seems that the producer side for the two cases is fundamentally different. Therefore I added a commit toying with the idea of providing

  • CUDADeviceChooserFilter: if decide to run on GPU, return true and produce CUDAToken, else return false and produce nothing
  • CUDADeviceChooserProducer: if decide to run on GPU, produce CUDAToken, else throw an exception

In short, yes, the producer and the filter can be (and makes sense to) combined for the case where(/if) we want to be able to dynamically decide whether a chain of CUDA EDModules should run on a GPU or a CPU.

@makortel
Copy link
Author

makortel commented Aug 13, 2018

Some random thoughts about CUDA streams

  • I start to think now that it would probably be better for CUDADeviceChooserFilter and CUDADeviceChooserProducer to own the CUDA stream (per EDM stream)
    • In principle tiny bit faster as they are not created+destroyed for each event and "chain of CUDA modules"
    • My main motivation comes from the profiling though. I believe it would be clearer in nvvp that a single CUDA stream id is always associated to the same EDM stream and to the same chain of modules (across events)
  • I believe (need to test it through of course) that overlapping the GPU->CPU transfer with kernels (on the same "computation CUDA stream") using another CUDA stream comes out in a straightforward way. We just need to add a variant of CUDADeviceChooserProducer (named e.g. CUDAStreamInDevice) which reads a CUDAToken and produces a new CUDAToken in the same CUDA device but with a new CUDA stream
  • We are currently using CUDA streams in beginStream() (where we do the block memory allocations) to "asynchronously" set memory or transfer some constant data. I used quotes because given the "global synchronization" nature of cudaMalloc I'm not sure how much we actually benefit from the CUDA streams in there, and whether it would be good-enough to do all memsets and transfers there synchronously (they are once-in-job anyway).

@makortel
Copy link
Author

Except the first two (CUDA stream owned by a module, and ability to create additional CUDA streams on a given device) are in conflict because (in general) I can't communicate the chosen device from a module to the beginStream() of another, and anyway the first bullet assumes the current EDM stream -> CUDA device mapping.

@fwyzard
Copy link

fwyzard commented Aug 14, 2018

@makortel

I believe (need to test it through of course) that overlapping the GPU->CPU transfer with kernels (on the same "computation CUDA stream") using another CUDA stream comes out in a straightforward way.

I am not sure I understand this point. Are you suggesting to use one CUDA stream to compute, and a separate CUDA stream for the transfer from GPU to CPU of the results ?

Do we submit the kernels in acquire() and run the transfer in the produce(), and rely on the explicit synchronisation from the framework (produce() runs after the callback from acquire()) ?

Or do we submit them all in acquire() but interleave them with CUDA events to enforce that the transfer waits for the kernel to have completed ?

I think the former is easier - but what do we gain from reusing the same CUDA stream for the chain of modules ?

@makortel
Copy link
Author

@fwyzard

I am not sure I understand this point. Are you suggesting to use one CUDA stream to compute, and a separate CUDA stream for the transfer from GPU to CPU of the results ?

To my understanding that is the standard "trick" to compute and transfer data in parallel. My main motivation was to think how that could be done within the context of this PR (regardless if we want to do that or not).

There would be some benefits (even with the assumption "we achieve the parallelism with EDM streams")

  • it exposes more parallel work for the GPU/driver
  • on-demand-scheduled GPU->CPU transfers would not incur additional latency on the "compute stream" of a chain of GPU modules

but what do we gain from reusing the same CUDA stream for the chain of modules ?

We get the behaviour equivalent to TBB Flowgraph's "streaming_node". I.e., if an EDProducer does not have to transfer anything back to CPU for subsequent work (like number of digis/clusters/hits/quadruplets), it can be a regular EDProducer just queueing more kernels to the CUDA stream. Performance benefit would come from running the GPU computations in parallel to the "framework overhead".

The "streaming_node" is not enforced though, it just emerges automatically if an EDProducer meets the necessary constraints.

Do we submit the kernels in acquire() and run the transfer in the produce(), and rely on the explicit synchronisation from the framework (produce() runs after the callback from acquire()) ?

Or do we submit them all in acquire() but interleave them with CUDA events to enforce that the transfer waits for the kernel to have completed ?

Closer to the latter. As a concrete example let's take the raw2cluster. The chain of events would be the following

  1. raw2cluster EDProducer acquire() queues all kernels to the CUDA stream (that it got from input CUDAToken)
  2. raw2cluster EDProducer acquire() queues the transfer of number of active modules and the number of clusters from GPU to CPU
    • needed by subsequent modules for their kernel launches
    • if subsequent modules would not need them, raw2cluster could be a regular EDProducer and queue all work in its produce()
  3. raw2cluster produce() puts a CUDA<T> in the event containing all the pointers to GPU memory and the two numbers mentioned in 2
  4. rechit etc. queue their work
  5. cluster GPU->CPU transfer EDProducer acquire() queues all GPU->CPU transfers for clusters
    • this module is run only if any module consumes() the CPU clusters
  6. cluster GPU->CPU transfer EDProducer produce() converts the CPU SOA to legacy formats
    • or, puts the CPU SOA in the event, and we have yet another EDProducer for the conversion to legacy format

If points 4 and 5 use the same CUDA stream, they will be run serially (5 gets inserted somewhere in the middle of subsequent work of 4, or after them). They can be made run in parallel by introducing additional CUDA stream, and the mechanism I described on slide 15 of #100 (comment) will take care of the synchronization with a CUDA event.

fwyzard pushed a commit that referenced this pull request Oct 23, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Nov 6, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Nov 16, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
makortel added a commit to makortel/cmssw that referenced this pull request Nov 20, 2020
makortel added a commit to makortel/cmssw that referenced this pull request Nov 24, 2020
fwyzard pushed a commit that referenced this pull request Nov 27, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard added a commit that referenced this pull request Nov 27, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard added a commit that referenced this pull request Nov 27, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard added a commit that referenced this pull request Nov 27, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard added a commit that referenced this pull request Nov 28, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 25, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard pushed a commit that referenced this pull request Dec 26, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard added a commit that referenced this pull request Dec 26, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Remove SiPixelDigiHeterogeneousConverter as obsolete, should have been removed as part of #100.

Address review comments for SiPixelClustersCUDA:
  - remove commented out default constructor and private: from DeviceConstView;
    this is perhaps the best compromise between non-default constructors not
    being preferred for device allocations, and the use case in
    SiPixelRecHitSoAFromLegacy (for the expected life time of this class)
  - remove const getters with c_ prefix
  - improve constructor parameter name
  - use more initializer list
  - initialize nClusters_h

Address review comments for SiPixelDigiErrorsCUDA:
  - use type alias
  - remove const getters with c_ prefix and other unnecessary methods
  - use more initializer list

Address review comments for SiPixelDigisCUDA:
  - remove const getters with c_ prefix and other unnecessary methods
  - remove commented out default constructor and private: from DeviceConstView
  - add comments for remaining SiPixelDigisCUDA member arrays

Move PixelErrorsCompact and SiPixelDigiErrorsSoa to DataFormats/SiPixelRawData, rename classes

Address review comments for SiPixelErrorsSoA
  - remove redundant assert
  - move constructor inline

Address review comments for SiPixelDigisSoA
  - remove redundant assert
  - add comments

Enable if constexpr also for CUDA in TrackingRecHit2DHeterogeneous

Move dictionary of HostProduct<unsigned int[]> to CUDADataFormats/Common
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Dec 29, 2020
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Jan 15, 2021
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Mar 23, 2021
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
fwyzard pushed a commit that referenced this pull request Apr 1, 2021
Provide a mechanism for a chain of modules to share a resource, that can be e.g. CUDA device memory or a CUDA stream.
Minimize data movements between the CPU and the device, and support multiple devices.
Allow the same job configuration to be used on all hardware combinations.

See HeterogeneousCore/CUDACore/README.md for a more detailed description and examples.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Projects
None yet
Development

Successfully merging this pull request may close these issues.

6 participants