Releases: oneapi-src/oneDNN
v2.6.1
This is a patch release containing the following changes to v2.6:
- Extended depthwise convolution post-op with support for arbitrary filter size, stride, and padding (79b019b)
- Improved GEMM performance with threadpool threading on system with Intel AVX2 instruction set (2be0060)
- Fixed runtime error in GPU reduction primitive for specific tensor sizes (efbf9b5)
- Improved convolution performance on GPUs with Xe-HPG IP (f8de0c9, c1fb8ac)
- Updated ITT API to 3.22.5 (9b18676)
- Fixed correctness issues in reorder implementation for non-x64 systems (9961b86, 1020631, 8b960df, ef1d9fa, 8edd859, 39edcf6, 3e0a0d9, 1dff625, 8661958)
- Fixed handling on
inf
and-inf
values in eltwise log algorithm (732cbdd, 3fd0f2e) - Improved depthwise convolution performance on GPUs with Xe-HPG IP (7a6fe1d)
- Addressed fails in
test_isa_hints
gtest on GPUs (78c1c68) - Fixed issues with bfloat16 GEMM producing NaNs in certain cases on GPUs with Xe-HPC IP (5d65970)
- Changed default layout to blocked for depthwise convolutions to avoid spurious reorders (78f231b)
- Addressed issue with incorrect values in padded areas for convolution with post-ops on GPUs (2e4ad3a)
- Fixed build issues with
-Werror=odr
option (27668dd) - Addressed issues detected by clang USAN in BRGEMM kernel (2bbaa30, 9b3826f, b59b027)
graph-v0.5.1
This is a patch release containing the following changes to graph-v0.5:
- Fixed the layout propagation of Reshape and Transpose operators in oneDNN backend (3b681d4, 09863f9)
- Enabled scalar Divide + MatMul fusion in oneDNN backend (d4c7dc6)
- Enabled Convolution + LeakyReLU fusion in oneDNN backend (b0f4dbb, c8fb4c1, e15979e)
- Improved the document of fusion patterns (b9a5238)
- Fixed operands swapping for binary operators (a07bfda, d2567d7)
- Worked around a false positive build issue in GCC11 for compiler backend (17a40d0)
graph-v0.4.3
This is a patch release containing the following changes to graph-v0.4.2:
graph-v0.5
This is the Alpha release for oneDNN Graph API based on oneDNN v2.6 release.
Functionality
-
Introduced FP32 and BF16 training support on CPU.
-
Introduced multiple layer perceptron (MLP) fusion supported by oneDNN Graph compiler with optimized code generation (experimental).
-
Updated API to comply with oneDNN Graph API specification v1.0-alpha.
Known Issues and Limitations
-
The weight’s opaque layout can be queried only from a compiled partition, which requires that input tensor shapes must be known at compilation time.
-
MHA and MLP fusion are not activated on machines without AVX-512 support, as oneDNN Graph compiler generates AVX-512 and newer instructions.
Thanks to the Contributors
This release contains contributions from the project core teams as well as Jiong Gong, Chunyuan Wu, Sanchit Jain, Yiqiang Li, Yunfei Mao, Kiefer Kuah and others.
v2.6
Performance Optimizations
- Intel Architecture Processors
- Improved performance for future Intel Xeon® Scalable processors (code name Sapphire Rapids). The functionality requires Linux kernel 5.16 or later.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
- Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved performance for future Intel Arc graphics (code name Alchemist and DG2).
- AArch64-based Processors
- Improved binary primitive performance with Arm Compute Library (ACL).
- Improved shuffle primitive performance for processors with SVE 512 support.
Functionality
- Introduced bfloat16 destination support for int8 convolution, matmul and inner product primitives for processors with Intel AVX-512 support and or future Intel Xeon® Scalable processors (code name Sapphire Rapids)
- Extended RNN primitive with support for AUGRU cell.
- Added support for non-zero negative slope in ReLU post-op for batch normalization primitive.
- Introduced support for mixed source and destination data types in softmax primitive.
- Introduced persistent cache API. This functionality allows to serialize and reuse JIT kernels.
Usability
- Added build time options to manage the set of supported instruction set architectures on Intel Graphics Products. See
ONEDNN_ENABLE_PRIMITIVE_GPU_ISA
for more details. This feature further reduces the binary footprint. - Extended built time options
ONEDNN_ENABLE_PRIMITIVE
andONEDNN_ENABLE_WORKLOAD
to GPU implementations. This feature further reduces the binary footprint. - Reduced stack consumption in GEMM implementation.
- Added command line help to benchdnn.
Deprecated Functionality
- Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.
Breaking Changes
- Removed performance optimizations for Intel Xeon Phi processors. oneDNN will continue to be functional on these processors using Intel AVX2 codepath.
Thanks to the Contributors
This release contains contributions from the project core team as well as Arthur Mitrano @aaraujom, Aslan @aslanxie, Attila T. Áfra @atafra, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Joel Dippold @jedippold, Jonathan Deakin @jondea, Jonathan Louis Kaplan @JLouisKaplan-Arm, Kentaro Kawakami @kawakami-k, Luke Ireland @LukeIreland1, Mesut Meterelliyoz @mmeterel, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tengfei Han @Tengfei09, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.
v2.5.4
This is a patch release containing the following changes to v2.5.3:
- Improved performance for batch normalization for tbb/threadpool (421a2ce, 7b7b763)
- Fixed implicit conversion from double to float in examples (866b9ac)
- Fixed issue in int8 matmul primitive for specific shapes (035c2d4, 9a1bf19)
- Fixed performance regression for matmul primitive with binary post op and broadcast (dcd61ef, 31dec32)
- Fixed performance regression in binary primitive when using NHWC layout (228493c)
v2.6-rc
This is a release candidate for oneDNN v2.6. Please provide feedback and submit defect reports via Github issues.
Performance Optimizations
- Intel Architecture Processors
- Improved performance for future Intel Xeon® Scalable processors (code name Sapphire Rapids). The functionality requires Linux kernel 5.16 or later.
- Improved performance of matmul primitive for processors with Intel AVX-512 support.
- Intel Graphics Products
- Improved performance for future Xe Architecture graphics (code name Ponte Vecchio).
- Improved performance for future Intel Arc graphics (code name Alchemist and DG2).
- AArch64-based Processors
- Improved binary primitive performance with Arm Compute Library (ACL).
- Improved shuffle primitive performance for processors with SVE 512 support.
Functionality
- Extended RNN primitive with support for AUGRU cell.
- Introduced support for mixed source and destination data types in softmax primitive.
- Introduced persistent cache API. This functionality allows to serialize and reuse JIT kernels.
Usability
- Added build time options to manage the set of supported instruction set architectures on Intel Graphics Products. See
ONEDNN_ENABLE_PRIMITIVE_GPU_ISA
for more details. This feature further reduces the binary footprint. - Extended built time options
ONEDNN_ENABLE_PRIMITIVE
andONEDNN_ENABLE_WORKLOAD
to GPU implementations. This feature further reduces the binary footprint. - Reduced stack consumption in GEMM implementation.
- Added command line help to benchdnn.
Deprecated Functionality
- Support for SYCL 1.2.1 (aka SYCL 2017 standard) is deprecated and will be removed in future releases.
Breaking Changes
- Removed performance optimizations for Intel Xeon Phi processors. oneDNN will continue to be functional on these processors using Intel AVX2 codepath.
Thanks to the Contributors
This release contains contributions from the project core team as well as Arthur Mitrano @aaraujom, Aslan @aslanxie, Attila T. Áfra @atafra, Damian Szwichtenberg @dszwicht, Diana Bite @diaena, Joel Dippold @jedippold, Jonathan Deakin @jondea, Jonathan Louis Kaplan @JLouisKaplan-Arm, Kentaro Kawakami @kawakami-k, Luke Ireland @LukeIreland1, Mesut Meterelliyoz @mmeterel, Nathan John Sircombe @nSircombe, Peter Caday @petercad, Tengfei Han @Tengfei09, and Thiago Macieira @thiagomacieira. We would also like to thank everyone who asked questions and reported issues.
graph-v0.4.2
This is a patch release containing the following changes to graph-v0.4.1:
- Fixed compiled partition cache by checking CPU threading number (68f262a, 343246e)
- Enabled binary add and multiply patterns (71a0cfe)
- Fixed the MHA (multi-head attention) patterns in compiler backend and benchdnn graph (45bbcb3, caaf841)
- Fixed the build issues for semi-compiler backend (62dd2ca, 738276a, 347f1a9, 2123326)
v2.5.3
This is a patch release containing the following changes to v2.5.2:
- Fixed accuracy issue in GELU post-op (3ff2c3d)
- Added ability to enable code only on non-x64 systems (ff7ae00)
- Fixed issue in reorder primitive on non-x64 systems (5917860)
- Fixed build issue on OSX11 and older cmake (d9c8bbe)
- Fixed assert in reorder primitive (79090bc)
- Documentation fixes (d290758, ee7eacb, 543b8f8)
- Fixed potential division by zero in example for binary primitive (2fffd96)
- Fixed SIGFPE issue in reorder primitive (8c291fc)
- Fixed potential size overflow in inner product primitive (c10f74a)
- Added logic to reduce the number of threads (tasks spawned for threadpool) for small shapes (8f885e7, 4053989, 49ec406, 2977360)
- Fixed SEGFAULT issue in matmul primitive (62c1170, a993d52)
- Added bf16 support for sum post-op (3d2c37e)
- Added fp:precise compiler flag for Intel Compiler identified as IntelLLVM (1558a4b)
- Fixed issue in bf16 convolution primitive when fused with binary (b379fd9)
- Fixed issue in backward depthwise convolution (d5e4122, f5cac23, eeaa19c)
- Fixed SEGFAULT in int8 convolution with eltwise post_op (32a629f)
- Fixed NaN issue in bf16 backward inner product (0c5e492)
- Fixed performance regression for binary with broadcast (f79b030, 58ce3c1)