CoreWeave supports the NVIDIA Collective Communication Library (NCCL) for powering multi-GPU and multi-node neural network training. NCCL underpins the vast majority of all distributed training frameworks such as DeepSpeed, PyTorch Distributed and Horovod.
NCCL is supported across all CoreWeave NVIDIA GPUs over Ethernet. In addition, the specialized A100 HGX clusters are built to the design of NVIDIA DGX SuperPODs, including NVIDIA Quantum InfiniBand networking and in-network collections using NVIDIA SHARP to deliver the highest distributed training performance possible.
This repository includes Dockerfiles that can be used directly or as a template for your distributed training applications. The Dockerfiles include the following components:
- NVIDIA Mellanox OFED Driver userspace components. The kernel side is installed on our bare-metal nodes and does not need to be installed by users. The OFED drivers are necessary for optimized InfiniBand communication.
- NVIDIA HPC-X which is a packaging of OpenMPI and UCX
- NVIDIA HPC-X OpenMPI compiled with external PMIx to enable SLURM integration
- NVIDIA GDRCopy libraries leverage GPUDirect RDMA for improved GPU to host memory copy performance in certain applications. The kernel support for GDRCopy exists on CoreWeave's bare-metal nodes. GDRCopy is only supported on A100 training clusters.
- NVIDIA NCCL SHARP Plugin for SHARP support in NCCL
- NVIDIA NCCL Tests for verification and benchmarking purposes
- NVIDIA DCGM for GPU tests and health checks
- NVIDIA bandwidthTest utility
- RDMA Perftest with GPUDirect
- OpenSSH server and related settings to enable images to easily be used as MPI Runners
CoreWeave also publishes images built from these Dockerfiles that can be used as base for your own images.
Image Tag | Ubuntu | CUDA | NCCL | HPC-X |
---|---|---|---|---|
ghcr.io/coreweave/nccl-tests:12.6.1-cudnn-devel-ubuntu20.04-nccl2.23.4-1-2ff05b2 | 20.04 | 12.6.1 | 2.23.4 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.4.1-cudnn-devel-ubuntu20.04-nccl2.23.4-1-2ff05b2 | 20.04 | 12.4.1 | 2.23.4 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.2.2-cudnn8-devel-ubuntu20.04-nccl2.21.5-1-2ff05b2 | 20.04 | 12.2.2 | 2.21.5 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.0.1-cudnn8-devel-ubuntu20.04-nccl2.19.3-1-2ff05b2 | 20.04 | 12.0.1 | 2.19.3 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:11.8.0-cudnn8-devel-ubuntu20.04-nccl2.16.5-1-868dc3d | 20.04 | 11.8.0 | 2.16.5 | 2.14.0 |
ghcr.io/coreweave/nccl-tests:12.6.1-cudnn-devel-ubuntu22.04-nccl2.23.4-1-2ff05b2 | 22.04 | 12.6.1 | 2.23.4 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.4.1-cudnn-devel-ubuntu22.04-nccl2.23.4-1-2ff05b2 | 22.04 | 12.4.1 | 2.23.4 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.2.2-cudnn8-devel-ubuntu22.04-nccl2.23.4-1-2ff05b2 | 22.04 | 12.2.2 | 2.23.4 | 2.20.0 |
ghcr.io/coreweave/nccl-tests:12.0.1-cudnn8-devel-ubuntu22.04-nccl2.18.5-1-2ff05b2 | 22.04 | 12.0.1 | 2.18.5 | 2.20.0 |
There are many sample jobs in this repo showing how to run distributed NCCL tests, using the following workload managers:
CoreWeave provides a managed instance of the MPI Operator to allow running MPI Jobs in a container native fashion. No installation is required by the user, simply execute an MPIJob manifest in your namespace.
Example manifests are provided in the mpi-operator/
directory. There you'll
find the following examples of 64 GPU (8 node) runs:
To start the NCCL test, apply the sample manifest into your namespace with
kubectl
:
$ kubectl apply -f nccl-test-distributed-h100-64-las1-sharp-mpijob.yaml
$ kubectl get pods
nccl-test-64-launcher-lnnrw 1/1 Running 0 14s
nccl-test-64-worker-0 1/1 Running 0 16s
nccl-test-64-worker-1 1/1 Running 0 16s
nccl-test-64-worker-10 1/1 Running 0 15s
...
$ kubectl logs -f -l=training.kubeflow.org/job-role=launcher
# nThread 1 nGpus 1 minBytes 4 maxBytes 2147483648 step: 2(factor) warmup iters: 50 iters: 50 validation: 1
#
...
# out-of-place in-place
# size count type redop root time algbw busbw #wrong time algbw busbw #wrong
# (B) (elements) (us) (GB/s) (GB/s) (us) (GB/s) (GB/s)
536870912 134217728 float sum -1 2984.6 179.88 356.01 0 2979.7 180.18 356.60 0
1073741824 268435456 float sum -1 5808.0 184.87 365.90 0 5882.2 182.54 361.28 0
2147483648 536870912 float sum -1 11163 192.37 380.73 0 11203 191.70 379.40 0
4294967296 1073741824 float sum -1 22181 193.63 383.23 0 22570 190.29 376.62 0
8589934592 2147483648 float sum -1 43980 195.31 386.56 0 44094 194.81 385.56 0
# Out of bounds values : 0 OK
# Avg bus bandwidth : 373.187
#
Before running a new instance of a test, delete the old with
kubectl delete mpijob <job name>
or kubectl delete mpijob --all
. Please
note that it is important to wait for all pods from an earlier job to finish
terminating before starting a new job with the same name.
CoreWeave provides a way to deploy a slurm cluster on top of our managed
kubernetes cluster using a tool called sunk
.
Example SBATCH
scripts are provided in the slurm/
directory. There you'll
find the following examples of 64 GPU (8 node) runs:
- A100 without enroot
- A100 with enroot
- H100 without enroot
- H100 with enroot
- H100 with enroot and SHARP
To submit the jobs on a slurm cluster, first copy the scripts onto the login node.
Various parameters are set by the scripts, but make sure to specify the desired partition when submitting the job.
To start the NCCL test, submit the job via sbatch
:
export PARTITION=<enter partition>
sbatch --partition="$PARTITION" nccl-test-distributed-a100-64.slurm
The logs will be written to ./nccl_test.out
.
Note: The jobs that don't use enroot rely on nccl-tests
being installed
at /opt/nccl-tests
, which will be true of every sunk
cluster.
Enroot is a tool that enables running unprivileged containers. In combination with pyxis, a slurm container plugin, you can run slurm jobs inside of docker images.
There are additional parameters enabled by
pyxis, but in these example scripts it gets
used via srun
's --container-image
parameter. This prevents having to
install the script and its requirements on all compute nodes.
Note: You can specify the container image in an sbatch
, but all the
commands will be then run from inside the container. Therefore, we recommend
only specifying the container image in any subsequent srun
calls.
Both of the workload managers can be used to run DeepSpeed based distributed training jobs similarly to how the NCCL test jobs are run. They both will create the MPI hostsfile for you, and DeepSpeed can simply be run as a command like you would with a manual hostsfile setup.
GDRCopy can be enabled to improve CPU
to GPU memory communication in certain use cases. GDRCopy is supported in NCCL
using a hidden environment variable NCCL_GDRCOPY_ENABLE
. In our testing,
performance improvements for regular NCCL allreduce workloads have not been
measured. We do not recommend enabling GDRCopy for NCCL without performing
adequate benchmarks to ensure that performance is improved. It is noted in the
GDRCopy documentation that performance in some cases is degraded instead of
improved.