-
Notifications
You must be signed in to change notification settings - Fork 14
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
- Loading branch information
Showing
1 changed file
with
314 additions
and
0 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,314 @@ | ||
+++ | ||
title = "CUDA.jl 5.4: Memory management mayhem" | ||
author = "Tim Besard" | ||
external = true | ||
abstract = """ | ||
CUDA.jl 5.4 comes with many memory-management related changes that should improve | ||
performance of memory-heavy applications, and make it easier to work with heterogeneous | ||
set-ups involving multiple GPUs or using both the CPU and GPU.""" | ||
+++ | ||
|
||
{{abstract}} | ||
|
||
Before anything else, let's get the breaking changes out of the way. CUDA.jl v5.4 only | ||
bumps the minor version, so it should be compatible with existing codebases. However, there | ||
are a couple of API changes that, although covered by appropriate deprecation warnings, | ||
applications should be updated to: | ||
|
||
* The `CUDA.Mem` submodule has been removed. All identifiers have been moved to the parent | ||
`CUDA` submodule, with a couple being renamed in the process: | ||
- `Mem.Device` and `Mem.DeviceBuffer` have been renamed to `CUDA.DeviceMemory` (the same | ||
applies to `Mem.Host` and `Mem.Unified`); | ||
- enums from the `Mem` submodule have gained a `MEM` suffix, e.g., `Mem.ATTACH_GLOBAL` has | ||
been renamed to `CUDA.MEM_ATTACH_GLOBAL`; | ||
- `Mem.set!` has been renamed to `CUDA.memset`; | ||
- `Mem.info()` has been renamed to `CUDA.memory_info()`; | ||
* `CUDA.memory_status()` has been renamed to `CUDA.pool_status()`; | ||
- `CUDA.available_memory()` has been renamed to `CUDA.free_memory()`. | ||
|
||
The meat of this release is in the memory management improvements detailed below. These | ||
changes can have a significant impact of the performance of your application, so it's | ||
recommended to thoroughly test your application after upgrading! | ||
|
||
|
||
## Eager garbage collection | ||
|
||
Julia is a garbage collected language, which means that (GPU) allocations can fail because | ||
garbage has piled up, necessitating a collection cycle. Previous versions of CUDA.jl handled | ||
this at the allocation site, detecting out-of-memory errors and triggering the GC. This was | ||
not ideal, as it could lead to significant pauses and a bloated memory usage. | ||
|
||
To improve this, **CUDA.jl v5.4 more accurately keeps track of memory usage, and uses that | ||
information to trigger the GC early at appropriate times**, e.g., when waiting for a kernel | ||
to finish. This should lead to more predictable performance, both by distributing the cost | ||
of garbage collection over time and by potentially masking it behind other operations. | ||
|
||
For example, the following toy model implemented with Flux.jl allocates a ton of memory: | ||
|
||
```julia | ||
using CUDA, Flux | ||
using MLUtils: DataLoader | ||
|
||
n_obs = 300_000 | ||
n_feature = 1000 | ||
X = rand(n_feature, n_obs) | ||
y = rand(1, n_obs) | ||
train_data = DataLoader((X, y) |< gpu; batchsize = 2048, shuffle=false) | ||
|
||
model = Dense(n_feature, >) |< gpu | ||
loss(m, _x, _y) = Flux.Losses.mse(m(_x), _>) | ||
opt_state = Flux.setup(Flux.Adam(), model) | ||
Flux.train!(loss, model, train_data, opt_state) | ||
for epoch in 1:100 | ||
Flux.train!(loss, model, train_data, opt_state) | ||
end | ||
``` | ||
|
||
Without eager garbage collection, this leads to expensive pauses while freeing a large | ||
amount of memory at every epoch. We can simulate this by artificially limiting the memory | ||
available to the GPU, while also disabling the new eager garbage collection feature by | ||
setting the `JULIA_CUDA_GC_EARLY` environment variable to `false` (this is a temporary knob | ||
that will be removed in the future, but may be useful now for evaluating the new feature): | ||
|
||
```text | ||
❯ JULIA_CUDA_GC_EARLY=false JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \ | ||
julia --project train.jl | ||
... | ||
[ Info: Epoch 90 train time 0.031s | ||
retry_reclaim: freed 2.865 GiB | ||
[ Info: Epoch 91 train time 0.031s | ||
[ Info: Epoch 92 train time 0.027s | ||
retry_reclaim: freed 2.865 GiB | ||
[ Info: Epoch 93 train time 0.03s | ||
retry_reclaim: freed 2.873 GiB | ||
[ Info: Epoch 94 train time 0.031s | ||
retry_reclaim: freed 2.873 GiB | ||
[ Info: Epoch 95 train time 0.03s | ||
retry_reclaim: freed 2.873 GiB | ||
[ Info: Epoch 96 train time 0.031s | ||
[ Info: Epoch 97 train time 0.027s | ||
retry_reclaim: freed 2.873 GiB | ||
[ Info: Epoch 98 train time 0.031s | ||
retry_reclaim: freed 2.865 GiB | ||
[ Info: Epoch 99 train time 0.031s | ||
retry_reclaim: freed 2.865 GiB | ||
[ Info: Epoch 100 train time 0.031s | ||
[ Info: Total time 4.307s | ||
``` | ||
|
||
With eager garbage collection enabled, more frequent but less costly pauses result in | ||
significantly improved performance: | ||
|
||
```text | ||
❯ JULIA_CUDA_GC_EARLY=true JULIA_CUDA_HARD_MEMORY_LIMIT=4GiB \ | ||
julia --project wip.jl | ||
... | ||
[ Info: Epoch 90 train time 0.031s | ||
maybe_collect: collected 1.8 GiB | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 91 train time 0.033s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 92 train time 0.031s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 93 train time 0.031s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 94 train time 0.03s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 95 train time 0.03s | ||
maybe_collect: collected 1.8 GiB | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 96 train time 0.033s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 97 train time 0.03s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 98 train time 0.03s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 99 train time 0.03s | ||
maybe_collect: collected 1.8 GiB | ||
[ Info: Epoch 100 train time 0.03s | ||
[ Info: Total time 3.76s | ||
``` | ||
|
||
Eager garbage collection is driven by a heuristic that considers the current memory | ||
pressure, how much memory was freed during previous collections, and how much time that | ||
took. It is possible that the current implementation is not optimal, so if you encounter | ||
performance issues, please file an issue. | ||
|
||
|
||
## Tracked memory allocations | ||
|
||
When working with multiple GPUs, it is important to differentiate between the device that | ||
memory was allocated on, and the device used to execute code. Practically, this meant that | ||
users of CUDA.jl had to manually remember that allocating and using `CuArray` objects | ||
(typically) needed to happen with the same device active. The same is true for streams, | ||
which are used to order operations executing on a single GPU. | ||
|
||
To improve this, **CUDA.jl now keeps track of the device that owns the memory, and the | ||
stream last used to access it, enabling the package to "do the right thing" when using that | ||
memory** in kernels or with library functionality. This does **not** mean that CUDA.jl will | ||
automatically switch the active device: We want to keep the user in control of that, as it | ||
often makes sense to access memory from another device, if your system supports it. | ||
|
||
Let's break down what the implications are of this change. | ||
|
||
**1. Using multiple GPUs** | ||
|
||
If you have multiple GPUs, it may be possible that direct P2P access between devices is | ||
possible (e.g., using NVLink, or just over PCIe). In this case, CUDA.jl will now | ||
automatically configure the system to allow such access, making it possible to seamlessly | ||
use memory allocated on one device in kernels executing on a different device: | ||
|
||
```julia | ||
julia> # Allocate memory on device 0 | ||
device!(0) | ||
CuDevice(0): Tesla V100-PCIE-16GB | ||
julia> a = CuArray([1]); | ||
|
||
julia> # Use on device 1 | ||
device!(1) | ||
CuDevice(1): Tesla V100S-PCIE-32GB | ||
julia> a .+ 1; | ||
``` | ||
|
||
If P2P access between devices is not possible, CUDA.jl will now raise an error instead of | ||
throwing an illegal memory access error as it did before: | ||
|
||
```julia | ||
julia> # Use on incompatible device 2 | ||
device!(2) | ||
CuDevice(2): NVIDIA GeForce GTX 1080 Ti | ||
julia> a .+ 1 | ||
ERROR: cannot take the GPU address of inaccessible device memory. | ||
|
||
You are trying to use memory from GPU 0 on GPU 2. | ||
P2P access between these devices is not possible; | ||
either switch to GPU 0 by calling `CUDA.device!(0)`, | ||
or copy the data to an array allocated on device 2. | ||
``` | ||
|
||
As the error message suggests, you can always copy memory between devices using the | ||
`copyto!` function. In this case, CUDA.jl will fall back to staging the copy on the host | ||
when P2P access is not possible. | ||
|
||
**2. Using multiple streams** | ||
|
||
Streams are used to order operations executing on a single GPU. In CUDA.jl, every Julia task | ||
has its own stream, making it very easy to group independent operations together, and make | ||
it possible for the GPU to potentially overlap execution of these operations. | ||
|
||
Before CUDA.jl v5.4, users had to be careful about synchronizing data used in multiple | ||
tasks. It was recommended, for example, to end every data-producing task with an explicit | ||
call to `synchronize()`, or alternatively make sure to `device_synchronize()` at the start | ||
of a data-consuming task. Now that CUDA.jl keeps track of the stream used to last access | ||
memory, it can automatically synchronize streams when needed: | ||
|
||
```julia | ||
# Allocate some data | ||
a = CUDA.zeros(4096, 4096) | ||
b = CUDA.zeros(4096, 4096) | ||
#synchronize() # No longer needed | ||
|
||
# Perform work on a task | ||
t = @async begin | ||
a * b | ||
#synchronize() # No longer needed | ||
end | ||
|
||
# Fetch the results | ||
c = fetch(t) | ||
``` | ||
|
||
**3. Using capturing APIs** | ||
|
||
All of the above is implemented by piggybacking on the function that converts memory objects | ||
to pointers, in the assumption that this will be the final operation before the memory is | ||
used. This is generally true, with one important exception: APIs that capture memory. For | ||
example, when recording an operation using the CUDA graph APIs, a memory address may be | ||
captured and used later without CUDA.jl being aware of it. | ||
|
||
CUDA.jl accounts for this by detecting conversions during stream capture, however, some APIs | ||
may not covered yet. If you encounter issues with capturing APIs, let us know, and keep | ||
using additional synchronization calls to ensure correctness. | ||
|
||
|
||
## Unified memory iteration | ||
|
||
Unified memory is a feature of CUDA that allows memory to be accessed from both the CPU and | ||
the GPU. We have now greatly **improved the performance of using unified memory with CPU | ||
code that iterates over elements** of a `CuArray`. Although this is typically unwanted, | ||
triggering the dreaded "scalar indexing" error when accessing device memory in such a way, | ||
it can be useful when incrementaly porting code to the GPU. | ||
|
||
Concretely, accessing elements of a unified `CuArray` on the CPU is much faster now: | ||
|
||
```julia-repl | ||
julia> # Reference | ||
a = [1]; | ||
julia> @btime $a[]; | ||
1.959 ns (0 allocations: 0 bytes) | ||
julia> b = cu(a; unified=true); | ||
julia> # Before | ||
@btime $b[] | ||
2.617 μs (0 allocations: 0 bytes); | ||
julia> # After | ||
@btime $b[]; | ||
4.140 ns (0 allocations: 0 bytes) | ||
``` | ||
|
||
Notice the different unit! This has a massive impact on real-life performance, for example, | ||
as demonstrated by calling `foldl` which does not have a GPU-optimized implementation: | ||
|
||
```julia-repl | ||
julia> a = cu(rand(1024, 1024); unified=true); | ||
julia> # Before | ||
@b foldl(+, a) | ||
4.210 s (9 allocs: 208 bytes, without a warmup) | ||
julia> # After | ||
@b foldl(+, a) | ||
3.107 ms (9 allocs: 208 bytes) | ||
``` | ||
|
||
For completeness, doing this with regular device memory triggers a scalar indexing error: | ||
|
||
```julia-repl | ||
julia> a = cu(rand(1024, 1024)); | ||
julia> foldl(+, a) | ||
ERROR: Scalar indexing is disallowed. | ||
``` | ||
|
||
These changes should make it easier to port applications to the GPU by incrementally moving | ||
parts of the codebase to the GPU without having to worry about the performance of accessing | ||
memory from the CPU. The only requirement is to use unified memory, e.g., by calling `cu` | ||
with `unified=true`, or setting the CUDA.jl preference `default_memory` to use unified | ||
memory by default. However, as unified memory comes with a slight cost, and results in | ||
synchronous allocation behavior, it is still recommended to switch back to regular device | ||
memory when your application has been fully ported to the GPU. | ||
|
||
|
||
## Other changes | ||
|
||
To keep this post from becoming even longer, a quick rundown of other changes: | ||
|
||
* [@wsmoses](https://github.com/wsmoses) introduced initial support for automatic | ||
differentiation of heterogeneous host/device code using Enzyme.jl. Before, you would have | ||
to differentiate through host and device code separately, and manually set up rules for | ||
crossing the host/device boundary. Now, you can differentiate through entire applications | ||
with ease; | ||
* `CUDA.@profile` now [automatically detects external | ||
profilers](https://github.com/JuliaGPU/CUDA.jl/pull/2339), so it should not be required to | ||
specify `external=true` anymore when running under NSight; | ||
* [Exception output has been improved](https://github.com/JuliaGPU/CUDA.jl/pull/2342), only | ||
reporting a single error message instead of generating output on each thread, and better | ||
forwarding the exception type; | ||
* Cached handles from libraries [will now be | ||
freed](https://github.com/JuliaGPU/CUDA.jl/pull/2352) when under memory pressure; | ||
* Tegra devices [are now supported](https://github.com/JuliaGPU/CUDA.jl/pull/2374) by our | ||
artifacts, obviating the use of a local toolkit; | ||
* Support for [CUDA 12.5](https://github.com/JuliaGPU/CUDA.jl/pull/2392) has been added, as | ||
well as [initial support for Julia 1.12](https://github.com/JuliaGPU/CUDA.jl/pull/2390). |