diff --git a/.wordlist.txt b/.wordlist.txt index f37b79f188..a68e052a36 100644 --- a/.wordlist.txt +++ b/.wordlist.txt @@ -107,6 +107,7 @@ FFT FFTs FFmpeg FHS +FIXME FMA FP FX @@ -131,6 +132,7 @@ GiB GIM GL GLXT +Gloo GMI GPG GPR @@ -148,6 +150,7 @@ HCA HGX HIPCC HIPExtension +HIPification HIPIFY HPC HPCG @@ -243,6 +246,7 @@ MyEnvironment MyST NBIO NBIOs +NCCL NIC NICs NLI @@ -401,9 +405,14 @@ TensorFlow TensorParallel ToC TorchAudio +torchaudio +TorchElastic TorchMIGraphX +torchrec TorchScript TorchServe +torchserve +torchtext TorchVision TransferBench TrapStatus @@ -510,6 +519,9 @@ copyable cpp csn cuBLAS +cuda +cuDNN +cudnn cuFFT cuLIB cuRAND @@ -674,6 +686,7 @@ prebuilt precompiled preconditioner preconfigured +preemptible prefetch prefetchable prefill @@ -690,6 +703,7 @@ profilers protobuf pseudorandom py +recommender quantile quantizer quasirandom diff --git a/docs/compatibility/compatibility-matrix-historical-6.0.csv b/docs/compatibility/compatibility-matrix-historical-6.0.csv index ff13f3c290..b53168bc8b 100644 --- a/docs/compatibility/compatibility-matrix-historical-6.0.csv +++ b/docs/compatibility/compatibility-matrix-historical-6.0.csv @@ -22,7 +22,7 @@ ROCm Version,6.3.1,6.3.0,6.2.4,6.2.2,6.2.1,6.2.0, 6.1.2, 6.1.1, 6.1.0, 6.0.2, 6. ,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908,gfx908 ,,,,,,,,,,, FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix-past-60:,,,,,,,,,, - :doc:`PyTorch `,"2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13" + :doc:`PyTorch <../compatibility/pytorch-compatiblity>`,"2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13","2.1, 2.0, 1.13" :doc:`TensorFlow `,"2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.16.1, 2.15.1, 2.14.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.15.0, 2.14.0, 2.13.1","2.14.0, 2.13.1, 2.12.1","2.14.0, 2.13.1, 2.12.1" :doc:`JAX `,0.4.35,0.4.35,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26,0.4.26 `ONNX Runtime `_,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.17.3,1.14.1,1.14.1 diff --git a/docs/compatibility/compatibility-matrix.rst b/docs/compatibility/compatibility-matrix.rst index ae157874f9..b58e8ecd46 100644 --- a/docs/compatibility/compatibility-matrix.rst +++ b/docs/compatibility/compatibility-matrix.rst @@ -47,7 +47,7 @@ compatibility and system requirements. ,gfx908,gfx908,gfx908 ,,, FRAMEWORK SUPPORT,.. _framework-support-compatibility-matrix:,, - :doc:`PyTorch `,"2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13" + :doc:`PyTorch <../compatibility/pytorch-compatiblity>`,"2.4, 2.3, 2.2, 1.13","2.4, 2.3, 2.2, 2.1, 2.0, 1.13","2.3, 2.2, 2.1, 2.0, 1.13" :doc:`TensorFlow `,"2.17.0, 2.16.2, 2.15.1","2.17.0, 2.16.2, 2.15.1","2.16.1, 2.15.1, 2.14.1" :doc:`JAX `,0.4.35,0.4.35,0.4.26 `ONNX Runtime `_,1.17.3,1.17.3,1.17.3 diff --git a/docs/compatibility/pytorch-compatiblity.rst b/docs/compatibility/pytorch-compatiblity.rst new file mode 100644 index 0000000000..a7da7fb37a --- /dev/null +++ b/docs/compatibility/pytorch-compatiblity.rst @@ -0,0 +1,951 @@ +.. meta:: + :description: PyTorch compatibility + :keywords: GPU, PyTorch compatibility + +******************************************************************************** +PyTorch compatibility +******************************************************************************** + +`PyTorch `_ is an open-source tensor library designed for +deep learning. PyTorch on ROCm provides mixed-precision and large-scale training +using `MIOpen `_ and +`RCCL `_ libraries. + +ROCm support for PyTorch is upstreamed into the official PyTorch repository. Due to independent +compatibility considerations, this results in two distinct release cycles for PyTorch on ROCm: + +- ROCm PyTorch release: + + - Provides the latest version of ROCm but doesn't immediately support the latest stable PyTorch + version. + + - Offers :ref:`Docker images ` with ROCm and PyTorch + pre-installed. + + - ROCm PyTorch repository: ``__ + + - See the :doc:`ROCm PyTorch installation guide ` to get started. + +- Official PyTorch release: + + - Provides the latest stable version of PyTorch but doesn't immediately support the latest ROCm version. + + - Official PyTorch repository: ``__ + + - See the `Nightly and latest stable version installation guide `_ + or `Previous versions `_ to get started. + +The upstream PyTorch includes an automatic HIPification solution that automatically generates HIP +source code from the CUDA backend. This approach allows PyTorch to support ROCm without requiring +manual code modifications. + +ROCm's development is aligned with the stable release of PyTorch while upstream PyTorch testing uses +the stable release of ROCm to maintain consistency. + +.. _pytorch-docker-compat: + +Docker image compatibility +================================================================================ + +AMD validates and publishes ready-made `PyTorch `_ +images with ROCm backends on Docker Hub. The following Docker image tags and +associated inventories are validated for `ROCm 6.3.0 `_. + +.. list-table:: PyTorch Docker image components + :header-rows: 1 + :class: docker-image-compatibility + + * - Docker + - PyTorch + - Ubuntu + - Python + - Apex + - torchvision + - TensorBoard + - MAGMA + - UCX + - OMPI + - OFED + + * - .. raw:: html + + + + - `2.4.0 `_ + - 24.04 + - `3.12 `_ + - `1.4.0 `_ + - `0.19.0 `_ + - `2.13.0 `_ + - `master `_ + - `1.10.0 `_ + - `4.0.7 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `2.4.0 `_ + - 22.04 + - `3.10 `_ + - `1.4.0 `_ + - `0.19.0 `_ + - `2.13.0 `_ + - `master `_ + - `1.10.0 `_ + - `4.0.7 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `2.4.0 `_ + - 22.04 + - `3.9 `_ + - `1.4.0 `_ + - `0.19.0 `_ + - `2.13.0 `_ + - `master `_ + - `1.10.0 `_ + - `4.0.7 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `2.3.0 `_ + - 22.04 + - `3.10 `_ + - `1.3.0 `_ + - `0.18.0 `_ + - `2.13.0 `_ + - `master `_ + - `1.14.1 `_ + - `4.1.5 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `2.2.1 `_ + - 22.04 + - `3.10 `_ + - `1.2.0 `_ + - `0.17.1 `_ + - `2.13.0 `_ + - `master `_ + - `1.14.1 `_ + - `4.1.5 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `2.2.1 `_ + - 20.04 + - `3.9 `_ + - `1.2.0 `_ + - `0.17.1 `_ + - `2.13.0 `_ + - `master `_ + - `1.10.0 `_ + - `4.0.3 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `1.13.1 `_ + - 22.04 + - `3.9 `_ + - `1.0.0 `_ + - `0.14.0 `_ + - `2.18.0 `_ + - `master `_ + - `1.14.1 `_ + - `4.1.5 `_ + - `5.3-1.0.5.0 `_ + + * - .. raw:: html + + + + - `1.13.1 `_ + - 20.04 + - `3.9 `_ + - `1.0.0 `_ + - `0.14.0 `_ + - `2.18.0 `_ + - `master `_ + - `1.10.0 `_ + - `4.0.3 `_ + - `5.3-1.0.5.0 `_ + +Critical ROCm libraries for PyTorch +================================================================================ + +The functionality of PyTorch with ROCm is shaped by its underlying library +dependencies. These critical ROCm components affect the capabilities, +performance, and feature set available to developers. + +.. list-table:: + :header-rows: 1 + + * - ROCm library + - Version + - Purpose + - Used in + * - `Composable Kernel `_ + - 1.1.0 + - Enables faster execution of core operations like matrix multiplication + (GEMM), convolutions and transformations. + - Speeds up ``torch.permute``, ``torch.view``, ``torch.matmul``, + ``torch.mm``, ``torch.bmm``, ``torch.nn.Conv2d``, ``torch.nn.Conv3d`` + and ``torch.nn.MultiheadAttention``. + * - `hipBLAS `_ + - 2.3.0 + - Provides GPU-accelerated Basic Linear Algebra Subprograms (BLAS) for + matrix and vector operations. + - Supports operations like matrix multiplication, matrix-vector products, + and tensor contractions. Utilized in both dense and batched linear + algebra operations. + * - `hipBLASLt `_ + - 0.10.0 + - hipBLASLt is an extension of the hipBLAS library, providing additional + features like epilogues fused into the matrix multiplication kernel or + use of integer tensor cores. + - It accelerates operations like ``torch.matmul``, ``torch.mm``, and the + matrix multiplications used in convolutional and linear layers. + * - `hipCUB `_ + - 3.3.0 + - Provides a C++ template library for parallel algorithms for reduction, + scan, sort and select. + - Supports operations like ``torch.sum``, ``torch.cumsum``, ``torch.sort`` + and ``torch.topk``. Operations on sparse tensors or tensors with + irregular shapes often involve scanning, sorting, and filtering, which + hipCUB handles efficiently. + * - `hipFFT `_ + - 1.0.17 + - Provides GPU-accelerated Fast Fourier Transform (FFT) operations. + - Used in functions like the ``torch.fft`` module. + * - `hipRAND `_ + - 2.11.0 + - Provides fast random number generation for GPUs. + - The ``torch.rand``, ``torch.randn`` and stochastic layers like + ``torch.nn.Dropout``. + * - `hipSOLVER `_ + - 2.3.0 + - Provides GPU-accelerated solvers for linear systems, eigenvalues, and + singular value decompositions (SVD). + - Supports functions like ``torch.linalg.solve``, + ``torch.linalg.eig``, and ``torch.linalg.svd``. + * - `hipSPARSE `_ + - 3.1.2 + - Accelerates operations on sparse matrices, such as sparse matrix-vector + or matrix-matrix products. + - Sparse tensor operations ``torch.sparse``. + * - `hipSPARSELt `_ + - 0.2.2 + - Accelerates operations on sparse matrices, such as sparse matrix-vector + or matrix-matrix products. + - Sparse tensor operations ``torch.sparse``. + * - `hipTensor `_ + - 1.4.0 + - Optimizes for high-performance tensor operations, such as contractions. + - Accelerates tensor algebra, especially in deep learning and scientific + computing. + * - `MIOpen `_ + - 3.3.0 + - Optimizes deep learning primitives such as convolutions, pooling, + normalization, and activation functions. + - Speeds up convolutional neural networks (CNNs), recurrent neural + networks (RNNs), and other layers. Used in operations like + ``torch.nn.Conv2d``, ``torch.nn.ReLU``, and ``torch.nn.LSTM``. + * - `MIGraphX `_ + - 2.11.0 + - Add graph-level optimizations, ONNX models and mixed precision support + and enable Ahead-of-Time (AOT) Compilation. + - Speeds up inference models and executes ONNX models for + compatibility with other frameworks. + ``torch.nn.Conv2d``, ``torch.nn.ReLU``, and ``torch.nn.LSTM``. + * - `MIVisionX `_ + - 3.1.0 + - Optimizes acceleration for computer vision and AI workloads like + preprocessing, augmentation, and inferencing. + - Faster data preprocessing and augmentation pipelines for datasets like + ImageNet or COCO and easy to integrate into PyTorch's ``torch.utils.data`` + and ``torchvision`` workflows. + * - `rocAL `_ + - 2.1.0 + - Accelerates the data pipeline by offloading intensive preprocessing and + augmentation tasks. rocAL is part of MIVisionX. + - Easy to integrate into PyTorch's ``torch.utils.data`` and + ``torchvision`` data load workloads. + * - `RCCL `_ + - 2.21.5 + - Optimizes for multi-GPU communication for operations like AllReduce and + Broadcast. + - Distributed data parallel training (``torch.nn.parallel.DistributedDataParallel``). + Handles communication in multi-GPU setups. + * - `rocDecode `_ + - 0.8.0 + - Provide hardware-accelerated data decoding capabilities, particularly + for image, video, and other dataset formats. + - Can be integrated in ``torch.utils.data``, ``torchvision.transforms`` + and ``torch.distributed``. + * - `rocJPEG `_ + - 0.6.0 + - Provide hardware-accelerated JPEG image decoding and encoding. + - GPU accelerated ``torchvision.io.decode_jpeg`` and + ``torchvision.io.encode_jpeg`` and can be integrated in + ``torch.utils.data`` and ``torchvision``. + * - `RPP `_ + - 1.9.1 + - Speed up data augmentation, transformation, and other preprocessing step. + - Easy to integrate into PyTorch's ``torch.utils.data`` and + ``torchvision`` data load workloads. + * - `rocThrust `_ + - 3.3.0 + - Provides a C++ template library for parallel algorithms like sorting, + reduction, and scanning. + - Utilized in backend operations for tensor computations requiring + parallel processing. + * - `rocWMMA `_ + - 1.6.0 + - Accelerates warp-level matrix-multiply and matrix-accumulate to speed up matrix + multiplication (GEMM) and accumulation operations with mixed precision + support. + - Linear layers (``torch.nn.Linear``), convolutional layers + (``torch.nn.Conv2d``), attention layers, general tensor operations that + involve matrix products, such as ``torch.matmul``, ``torch.bmm``, and + more. + +Supported and unsupported features +================================================================================ + +The following section maps GPU-accelerated PyTorch features to their supported +ROCm and PyTorch versions. + +torch +-------------------------------------------------------------------------------- + +`torch `_ is the central module of +PyTorch, providing data structures for multi-dimensional tensors and +implementing mathematical operations on them. It also includes utilities for +efficient serialization of tensors and arbitrary data types, along with various +other tools. + +Tensor data types +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The data type of a tensor is specified using the ``dtype`` attribute or argument, and PyTorch supports a wide range of data types for different use cases. + +The following table lists `torch.Tensor `_'s single data types: + +.. list-table:: + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + - Since ROCm + * - ``torch.float8_e4m3fn`` + - 8-bit floating point, e4m3 + - 2.3 + - 5.5 + * - ``torch.float8_e5m2`` + - 8-bit floating point, e5m2 + - 2.3 + - 5.5 + * - ``torch.float16`` or ``torch.half`` + - 16-bit floating point + - 0.1.6 + - 2.0 + * - ``torch.bfloat16`` + - 16-bit floating point + - 1.6 + - 2.6 + * - ``torch.float32`` or ``torch.float`` + - 32-bit floating point + - 0.1.12_2 + - 2.0 + * - ``torch.float64`` or ``torch.double`` + - 64-bit floating point + - 0.1.12_2 + - 2.0 + * - ``torch.complex32`` or ``torch.chalf`` + - PyTorch provides native support for 32-bit complex numbers + - 1.6 + - 2.0 + * - ``torch.complex64`` or ``torch.cfloat`` + - PyTorch provides native support for 64-bit complex numbers + - 1.6 + - 2.0 + * - ``torch.complex128`` or ``torch.cdouble`` + - PyTorch provides native support for 128-bit complex numbers + - 1.6 + - 2.0 + * - ``torch.uint8`` + - 8-bit integer (unsigned) + - 0.1.12_2 + - 2.0 + * - ``torch.uint16`` + - 16-bit integer (unsigned) + - 2.3 + - Not natively supported + * - ``torch.uint32`` + - 32-bit integer (unsigned) + - 2.3 + - Not natively supported + * - ``torch.uint64`` + - 32-bit integer (unsigned) + - 2.3 + - Not natively supported + * - ``torch.int8`` + - 8-bit integer (signed) + - 1.12 + - 5.0 + * - ``torch.int16`` or ``torch.short`` + - 16-bit integer (signed) + - 0.1.12_2 + - 2.0 + * - ``torch.int32`` or ``torch.int`` + - 32-bit integer (signed) + - 0.1.12_2 + - 2.0 + * - ``torch.int64`` or ``torch.long`` + - 64-bit integer (signed) + - 0.1.12_2 + - 2.0 + * - ``torch.bool`` + - Boolean + - 1.2 + - 2.0 + * - ``torch.quint8`` + - Quantized 8-bit integer (unsigned) + - 1.8 + - 5.0 + * - ``torch.qint8`` + - Quantized 8-bit integer (signed) + - 1.8 + - 5.0 + * - ``torch.qint32`` + - Quantized 32-bit integer (signed) + - 1.8 + - 5.0 + * - ``torch.quint4x2`` + - Quantized 4-bit integer (unsigned) + - 1.8 + - 5.0 + +.. note:: + + Unsigned types aside from ``uint8`` are currently only have limited support in + eager mode (they primarily exist to assist usage with ``torch.compile``). + + The :doc:`ROCm precision support page ` + collected the native HW support of different data types. + +torch.cuda +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``torch.cuda`` in PyTorch is a module that provides utilities and functions for +managing and utilizing AMD and NVIDIA GPUs. It enables GPU-accelerated +computations, memory management, and efficient execution of tensor operations, +leveraging ROCm and CUDA as the underlying frameworks. + +.. list-table:: + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + - Since ROCm + * - Device management + - Utilities for managing and interacting with GPUs. + - 0.4.0 + - 3.8 + * - Tensor operations on GPU + - Perform tensor operations such as addition and matrix multiplications on + the GPU. + - 0.4.0 + - 3.8 + * - Streams and events + - Streams allow overlapping computation and communication for optimized + performance, events enable synchronization. + - 1.6.0 + - 3.8 + * - Memory management + - Functions to manage and inspect memory usage like + ``torch.cuda.memory_allocated()``, ``torch.cuda.max_memory_allocated()``, + ``torch.cuda.memory_reserved()`` and ``torch.cuda.empty_cache()``. + - 0.3.0 + - 1.9.2 + * - Running process lists of memory management + - Return a human-readable printout of the running processes and their GPU + memory use for a given device with functions like + ``torch.cuda.memory_stats()`` and ``torch.cuda.memory_summary()``. + - 1.8.0 + - 4.0 + * - Communication collectives + - A set of APIs that enable efficient communication between multiple GPUs, + allowing for distributed computing and data parallelism. + - 1.9.0 + - 5.0 + * - ``torch.cuda.CUDAGraph`` + - Graphs capture sequences of GPU operations to minimize kernel launch + overhead and improve performance. + - 1.10.0 + - 5.3 + * - TunableOp + - A mechanism that allows certain operations to be more flexible and + optimized for performance. It enables automatic tuning of kernel + configurations and other settings to achieve the best possible + performance based on the specific hardware (GPU) and workload. + - 2.0 + - 5.4 + * - NVIDIA Tools Extension (NVTX) + - Integration with NVTX for profiling and debugging GPU performance using + NVIDIA's Nsight tools. + - 1.8.0 + - ❌ + * - Lazy loading NVRTC + - Delays JIT compilation with NVRTC until the code is explicitly needed. + - 1.13.0 + - ❌ + * - Jiterator (beta) + - Jiterator allows asynchronous data streaming into computation streams + during training loops. + - 1.13.0 + - 5.2 + +.. Need to validate and extend. + +torch.backends.cuda +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +``torch.backends.cuda`` is a PyTorch module that provides configuration options +and flags to control the behavior of CUDA or ROCm operations. It is part of the +PyTorch backend configuration system, which allows users to fine-tune how +PyTorch interacts with the CUDA or ROCm environment. + +.. list-table:: + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + - Since ROCm + * - ``cufft_plan_cache`` + - Manages caching of GPU FFT plans to optimize repeated FFT computations. + - 1.7.0 + - 5.0 + * - ``matmul.allow_tf32`` + - Enables or disables the use of TensorFloat-32 (TF32) precision for + faster matrix multiplications on GPUs with Tensor Cores. + - 1.10.0 + - ❌ + * - ``matmul.allow_fp16_reduced_precision_reduction`` + - Reduced precision reductions (e.g., with fp16 accumulation type) are + allowed with fp16 GEMMs. + - 2.0 + - ❌ + * - ``matmul.allow_bf16_reduced_precision_reduction`` + - Reduced precision reductions are allowed with bf16 GEMMs. + - 2.0 + - ❌ + * - ``enable_cudnn_sdp`` + - Globally enables cuDNN SDPA's kernels within SDPA. + - 2.0 + - ❌ + * - ``enable_flash_sdp`` + - Globally enables or disables FlashAttention for SDPA. + - 2.1 + - ❌ + * - ``enable_mem_efficient_sdp`` + - Globally enables or disables Memory-Efficient Attention for SDPA. + - 2.1 + - ❌ + * - ``enable_math_sdp`` + - Globally enables or disables the PyTorch C++ implementation within SDPA. + - 2.1 + - ❌ + * - ``allow_fp16_bf16_reduction_math_sdp`` + - Globally enables FP16 and BF16 precision for reduction operations within + SDPA. + - 2.1 + - +.. + FIXME: + - Partial? + +.. Need to validate and extend. + +torch.backends.cudnn +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +Supported ``torch`` options: + +.. list-table:: + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + - Since ROCm + * - ``allow_tf32`` + - TensorFloat-32 tensor cores may be used in cuDNN convolutions on NVIDIA + Ampere or newer GPUs. + - 1.12.0 + - ❌ + * - ``deterministic`` + - A bool that, if True, causes cuDNN to only use deterministic + convolution algorithms. + - 1.12.0 + - 6.0 + +Automatic mixed precision: torch.amp +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +PyTorch that automates the process of using both 16-bit (half-precision, +float16) and 32-bit (single-precision, float32) floating-point types in model +training and inference. + +.. list-table:: + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + - Since ROCm + * - Autocasting + - Instances of autocast serve as context managers or decorators that allow + regions of your script to run in mixed precision. + - 1.9 + - 2.5 + * - Gradient scaling + - To prevent underflow, “gradient scaling” multiplies the network’s + loss(es) by a scale factor and invokes a backward pass on the scaled + loss(es). Gradients flowing backward through the network are then + scaled by the same factor. In other words, gradient values have a + larger magnitude, so they don’t flush to zero. + - 1.9 + - 2.5 + * - CUDA op-specific behavior + - These ops always go through autocasting whether they are invoked as part + of a ``torch.nn.Module``, as a function, or as a ``torch.Tensor`` method. If + functions are exposed in multiple namespaces, they go through + autocasting regardless of the namespace. + - 1.9 + - 2.5 + +Distributed library features +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +The PyTorch distributed library includes a collective of parallelism modules, a +communications layer, and infrastructure for launching and debugging large +training jobs. See :ref:`rocm-for-ai-pytorch-distributed` for more information. + +The Distributed Library feature in PyTorch provides tools and APIs for building +and running distributed machine learning workflows. It allows training models +across multiple processes, GPUs, or nodes in a cluster, enabling efficient use +of computational resources and scalability for large-scale tasks. + +.. list-table:: + :header-rows: 1 + + * - Features + - Description + - Since PyTorch + - Since ROCm + * - TensorPipe + - TensorPipe is a point-to-point communication library integrated into + PyTorch for distributed training. It is designed to handle tensor data + transfers efficiently between different processes or devices, including + those on separate machines. + - 1.8 + - 5.4 + * - RPC Device Map Passing + - RPC Device Map Passing in PyTorch refers to a feature of the Remote + Procedure Call (RPC) framework that enables developers to control and + specify how tensors are transferred between devices during remote + operations. It allows fine-grained management of device placement when + sending tensors across nodes in distributed training or execution + scenarios. + - 1.9 + - + * - Gloo + - Gloo is designed for multi-machine and multi-GPU setups, enabling + efficient communication and synchronization between processes. Gloo is + one of the default backends for PyTorch's Distributed Data Parallel + (DDP) and RPC frameworks, alongside other backends like NCCL and MPI. + - 1.0 + - 2.0 + * - MPI + - MPI (Message Passing Interface) in PyTorch refers to the use of the MPI + backend for distributed communication in the ``torch.distributed`` module. + It enables inter-process communication, primarily in distributed + training settings, using the widely adopted MPI standard. + - 1.9 + - + * - TorchElastic + - TorchElastic is a PyTorch library that enables fault-tolerant and + elastic training in distributed environments. It is designed to handle + dynamically changing resources, such as adding or removing nodes during + training, which is especially useful in cloud-based or preemptible + environments. + - 1.9 + - + +.. + FIXME: RPC Device Map Passing "Since ROCm version" + +torch.compiler +^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^^ + +.. list-table:: + :header-rows: 1 + + * - Features + - Description + - Since PyTorch + - Since ROCm + * - ``torch.compiler`` (AOT Autograd) + - Autograd captures not only the user-level code, but also backpropagation, + which results in capturing the backwards pass “ahead-of-time”. This + enables acceleration of both forwards and backwards pass using + ``TorchInductor``. + - 2.0 + - 5.3 + * - ``torch.compiler`` (TorchInductor) + - The default ``torch.compile`` deep learning compiler that generates fast + code for multiple accelerators and backends. You need to use a backend + compiler to make speedups through ``torch.compile`` possible. For AMD, + NVIDIA, and Intel GPUs, it leverages OpenAI Triton as the key building block. + - 2.0 + - 5.3 + +torchaudio +-------------------------------------------------------------------------------- + +The `torchaudio `_ library provides +utilities for processing audio data in PyTorch, such as audio loading, +transformations, and feature extraction. + +To ensure GPU-acceleration with ``torchaudio.transforms``, you need to move audio +data (waveform tensor) explicitly to GPU using ``.to('cuda')``. + +The following ``torchaudio`` features are GPU-accelerated. + +.. list-table:: + :header-rows: 1 + + * - Features + - Description + - Since torchaudio version + - Since ROCm + * - ``torchaudio.transforms.Spectrogram`` + - Generate spectrogram of an input waveform using STFT. + - 0.6.0 + - 4.5 + * - ``torchaudio.transforms.MelSpectrogram`` + - Generate the mel-scale spectrogram of raw audio signals. + - 0.9.0 + - 4.5 + * - ``torchaudio.transforms.MFCC`` + - Extract of MFCC features. + - 0.9.0 + - 4.5 + * - ``torchaudio.transforms.Resample`` + - Resample a signal from one frequency to another + - 0.9.0 + - 4.5 + +torchvision +-------------------------------------------------------------------------------- + +The `torchvision `_ library +provide datasets, model architectures, and common image transformations for +computer vision. + +The following ``torchvision`` features are GPU-accelerated. + +.. list-table:: + :header-rows: 1 + + * - Features + - Description + - Since torchvision version + - Since ROCm + * - ``torchvision.transforms.functional`` + - Provides GPU-compatible transformations for image preprocessing like + resize, normalize, rotate and crop. + - 0.2.0 + - 4.0 + * - ``torchvision.ops`` + - GPU-accelerated operations for object detection and segmentation tasks. + ``torchvision.ops.roi_align``, ``torchvision.ops.nms`` and + ``box_convert``. + - 0.6.0 + - 3.3 + * - ``torchvision.models`` with ``.to('cuda')`` + - ``torchvision`` provides several pre-trained models (ResNet, Faster + R-CNN, Mask R-CNN, ...) that can run on CUDA for faster inference and + training. + - 0.1.6 + - 2.x + * - ``torchvision.io`` + - Video decoding and frame extraction using GPU acceleration with NVIDIA’s + NVDEC and nvJPEG (rocJPEG) on CUDA-enabled GPUs. + - 0.4.0 + - 6.3 + +torchtext +-------------------------------------------------------------------------------- + +The `torchtext `_ library provides +utilities for processing and working with text data in PyTorch, including +tokenization, vocabulary management, and text embeddings. torchtext supports +preprocessing pipelines and integration with PyTorch models, simplifying the +implementation of natural language processing (NLP) tasks. + +To leverage GPU acceleration in torchtext, you need to move tensors +explicitly to the GPU using ``.to('cuda')``. + +* torchtext does not implement its own kernels. ROCm support is enabled by linking against ROCm libraries. + +* Only official release exists. + +torchtune +-------------------------------------------------------------------------------- + +The `torchtune `_ library for +authoring, fine-tuning and experimenting with LLMs. + +* Usage: It works out-of-the-box, enabling developers to fine-tune ROCm PyTorch solutions. + +* Only official release exists. + +torchserve +-------------------------------------------------------------------------------- + +The `torchserve `_ is a PyTorch domain library +for common sparsity and parallelism primitives needed for large-scale recommender +systems. + +* torchtext does not implement its own kernels. ROCm support is enabled by linking against ROCm libraries. + +* Only official release exists. + +torchrec +-------------------------------------------------------------------------------- + +The `torchrec `_ is a PyTorch domain library for +common sparsity and parallelism primitives needed for large-scale recommender +systems. + +* torchrec does not implement its own kernels. ROCm support is enabled by linking against ROCm libraries. + +* Only official release exists. + +Unsupported PyTorch features +---------------------------- + +The following are GPU-accelerated PyTorch features not currently supported by ROCm. + +.. list-table:: + :widths: 30, 60, 10 + :header-rows: 1 + + * - Data type + - Description + - Since PyTorch + * - APEX batch norm + - Use APEX batch norm instead of PyTorch batch norm. + - 1.6.0 + * - ``torch.backends.cuda`` / ``matmul.allow_tf32`` + - A bool that controls whether TensorFloat-32 tensor cores may be used in + matrix multiplications. + - 1.7 + * - ``torch.cuda`` / NVIDIA Tools Extension (NVTX) + - Integration with NVTX for profiling and debugging GPU performance using + NVIDIA's Nsight tools. + - 1.7.0 + * - ``torch.cuda`` / Lazy loading NVRTC + - Delays JIT compilation with NVRTC until the code is explicitly needed. + - 1.8.0 + * - ``torch-tensorrt`` + - Integrate TensorRT library for optimizing and deploying PyTorch models. + ROCm does not have equialent library for TensorRT. + - 1.9.0 + * - ``torch.backends`` / ``cudnn.allow_tf32`` + - TensorFloat-32 tensor cores may be used in cuDNN convolutions. + - 1.10.0 + * - ``torch.backends.cuda`` / ``matmul.allow_fp16_reduced_precision_reduction`` + - Reduced precision reductions with fp16 accumulation type are + allowed with fp16 GEMMs. + - 2.0 + * - ``torch.backends.cuda`` / ``matmul.allow_bf16_reduced_precision_reduction`` + - Reduced precision reductions are allowed with bf16 GEMMs. + - 2.0 + * - ``torch.nn.functional`` / ``scaled_dot_product_attention`` + - Flash attention backend for SDPA to accelerate attention computation in + transformer-based models. + - 2.0 + * - ``torch.backends.cuda`` / ``enable_cudnn_sdp`` + - Globally enables cuDNN SDPA's kernels within SDPA. + - 2.0 + * - ``torch.backends.cuda`` / ``enable_flash_sdp`` + - Globally enables or disables FlashAttention for SDPA. + - 2.1 + * - ``torch.backends.cuda`` / ``enable_mem_efficient_sdp`` + - Globally enables or disables Memory-Efficient Attention for SDPA. + - 2.1 + * - ``torch.backends.cuda`` / ``enable_math_sdp`` + - Globally enables or disables the PyTorch C++ implementation within SDPA. + - 2.1 + * - Dynamic parallelism + - PyTorch itself does not directly expose dynamic parallelism as a core + feature. Dynamic parallelism allow GPU threads to launch additional + threads which can be reached using custom operations via the + ``torch.utils.cpp_extension`` module. + - Not a core feature + * - Unified memory support in PyTorch + - Unified Memory is not directly exposed in PyTorch's core API, it can be + utilized effectively through custom CUDA extensions or advanced + workflows. + - Not a core feature + +Use cases and recommendations +================================================================================ + +* :doc:`Using ROCm for AI: training a model ` provides + guidance on how to leverage the ROCm platform for training AI models. It covers the steps, tools, and best practices + for optimizing training workflows on AMD GPUs using PyTorch features. + +* :doc:`Single-GPU fine-tuning and inference ` + describes and demonstrates how to use the ROCm platform for the fine-tuning and inference of + machine learning models, particularly large language models (LLMs), on systems with a single AMD + Instinct MI300X accelerator. This page provides a detailed guide for setting up, optimizing, and + executing fine-tuning and inference workflows in such environments. + +* :doc:`Multi-GPU fine-tuning and inference optimization ` + describes and demonstrates the fine-tuning and inference of machine learning models on systems + with multi MI300X accelerators. + +* The :doc:`Instinct MI300X workload optimization guide ` provides detailed + guidance on optimizing workloads for the AMD Instinct MI300X accelerator using ROCm. This guide is aimed at helping + users achieve optimal performance for deep learning and other high-performance computing tasks on the MI300X + accelerator. + +* The :doc:`Inception with PyTorch documentation ` + describes how PyTorch integrates with ROCm for AI workloads It outlines the use of PyTorch on the ROCm platform and + focuses on how to efficiently leverage AMD GPU hardware for training and inference tasks in AI applications. + +For more use cases and recommendations, see `ROCm PyTorch blog posts `_ diff --git a/docs/how-to/deep-learning-rocm.rst b/docs/how-to/deep-learning-rocm.rst index 82df7419a9..60944e066a 100644 --- a/docs/how-to/deep-learning-rocm.rst +++ b/docs/how-to/deep-learning-rocm.rst @@ -11,11 +11,9 @@ ROCm provides a comprehensive ecosystem for deep learning development, including deep learning frameworks and libraries such as PyTorch, TensorFlow, and JAX. ROCm works closely with these frameworks to ensure that framework-specific optimizations take advantage of AMD accelerator and GPU architectures. -The following guides cover installation processes for ROCm-aware deep learning frameworks. +The following guides provide information on compatibility and supported features for ROCm-enabled deep learning frameworks. -* :doc:`PyTorch for ROCm ` -* :doc:`TensorFlow for ROCm ` -* :doc:`JAX for ROCm ` +* :doc:`PyTorch compatibility <../compatibility/pytorch-compatibility>` The following chart steps through typical installation workflows for installing deep learning frameworks for ROCm. @@ -23,8 +21,11 @@ The following chart steps through typical installation workflows for installing :alt: Flowchart for installing ROCm-aware machine learning frameworks :align: center -Find information on version compatibility and framework release notes in :doc:`Third-party support matrix -`. +See the installation instructions to get started. + +* :doc:`PyTorch for ROCm ` +* :doc:`TensorFlow for ROCm ` +* :doc:`JAX for ROCm ` .. note::