diff --git a/docs/build/eps.md b/docs/build/eps.md index 12fc4d3235bb3..40bf99be46bff 100644 --- a/docs/build/eps.md +++ b/docs/build/eps.md @@ -260,13 +260,13 @@ See more information on the OpenVINO™ Execution Provider [here](../execution-p ### Prerequisites {: .no_toc } -1. Install the OpenVINO™ offline/online installer from Intel® Distribution of OpenVINO™TM Toolkit **Release 2024.1** for the appropriate OS and target hardware: - * [Windows - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?VERSION=v_2023_1_0&OP_SYSTEM=WINDOWS&DISTRIBUTION=ARCHIVE). - * [Linux - CPU, GPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?VERSION=v_2023_1_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) +1. Install the OpenVINO™ offline/online installer from Intel® Distribution of OpenVINO™TM Toolkit **Release 2024.3** for the appropriate OS and target hardware: + * [Windows - CPU, GPU, NPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2024_3_0&OP_SYSTEM=WINDOWS&DISTRIBUTION=ARCHIVE). + * [Linux - CPU, GPU](https://www.intel.com/content/www/us/en/developer/tools/openvino-toolkit/download.html?PACKAGE=OPENVINO_BASE&VERSION=v_2024_3_0&OP_SYSTEM=LINUX&DISTRIBUTION=ARCHIVE) Follow [documentation](https://docs.openvino.ai/2024/home.html) for detailed instructions. - *2024.1 is the current recommended OpenVINO™ version. [OpenVINO™ 2023.1](https://docs.openvino.ai/archive/2023.1/home.html) is minimal OpenVINO™ version requirement.* + *2024.3 is the current recommended OpenVINO™ version. [OpenVINO™ 2023.3](https://docs.openvino.ai/2023.3/home.html) is minimal OpenVINO™ version requirement.* 2. Configure the target hardware with specific follow on instructions: * To configure Intel® Processor Graphics(GPU) please follow these instructions: [Windows](https://docs.openvino.ai/latest/openvino_docs_install_guides_configurations_for_intel_gpu.html#gpu-guide-windows), [Linux](https://docs.openvino.ai/latest/openvino_docs_install_guides_configurations_for_intel_gpu.html#linux) @@ -396,75 +396,24 @@ The DirectML execution provider supports building for both x64 and x86 architect --- -## ARM Compute Library +## Arm Compute Library See more information on the ACL Execution Provider [here](../execution-providers/community-maintained/ACL-ExecutionProvider.md). -### Prerequisites -{: .no_toc } - -* Supported backend: i.MX8QM Armv8 CPUs -* Supported BSP: i.MX8QM BSP - * Install i.MX8QM BSP: `source fsl-imx-xwayland-glibc-x86_64-fsl-image-qt5-aarch64-toolchain-4*.sh` -* Set up the build environment -``` -source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux -alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake" -``` -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices - ### Build Instructions {: .no_toc } -1. Configure ONNX Runtime with ACL support: -``` -cmake ../onnxruntime-arm-upstream/cmake -DONNX_CUSTOM_PROTOC_EXECUTABLE=/usr/bin/protoc -Donnxruntime_RUN_ONNX_TESTS=OFF -Donnxruntime_GENERATE_TEST_REPORTS=ON -Donnxruntime_DEV_MODE=ON -DPYTHON_EXECUTABLE=/usr/bin/python3 -Donnxruntime_USE_CUDA=OFF -Donnxruntime_USE_NSYNC=OFF -Donnxruntime_CUDNN_HOME= -Donnxruntime_USE_JEMALLOC=OFF -Donnxruntime_ENABLE_PYTHON=OFF -Donnxruntime_BUILD_CSHARP=OFF -Donnxruntime_BUILD_SHARED_LIB=ON -Donnxruntime_USE_EIGEN_FOR_BLAS=ON -Donnxruntime_USE_OPENBLAS=OFF -Donnxruntime_USE_ACL=ON -Donnxruntime_USE_DNNL=OFF -Donnxruntime_USE_MKLML=OFF -Donnxruntime_USE_OPENMP=ON -Donnxruntime_USE_TVM=OFF -Donnxruntime_USE_LLVM=OFF -Donnxruntime_ENABLE_MICROSOFT_INTERNAL=OFF -Donnxruntime_USE_BRAINSLICE=OFF -Donnxruntime_USE_EIGEN_THREADPOOL=OFF -Donnxruntime_BUILD_UNIT_TESTS=ON -DCMAKE_BUILD_TYPE=RelWithDebInfo -``` -The ```-Donnxruntime_USE_ACL=ON``` option will use, by default, the 19.05 version of the Arm Compute Library. To set the right version you can use: -```-Donnxruntime_USE_ACL_1902=ON```, ```-Donnxruntime_USE_ACL_1905=ON```, ```-Donnxruntime_USE_ACL_1908=ON``` or ```-Donnxruntime_USE_ACL_2002=ON```; - -To use a library outside the normal environment you can set a custom path by using ```-Donnxruntime_ACL_HOME``` and ```-Donnxruntime_ACL_LIBS``` tags that defines the path to the ComputeLibrary directory and the build directory respectively. +You must first build Arm Compute Library 24.07 for your platform as described in the [documentation](https://github.com/ARM-software/ComputeLibrary). +See [here](inferencing.md#arm) for information on building for Arm®-based devices. -```-Donnxruntime_ACL_HOME=/path/to/ComputeLibrary```, ```-Donnxruntime_ACL_LIBS=/path/to/build``` +Add the following options to `build.sh` to enable the ACL Execution Provider: - -2. Build ONNX Runtime library, test and performance application: -``` -make -j 6 -``` - -3. Deploy ONNX runtime on the i.MX 8QM board ``` -libonnxruntime.so.0.5.0 -onnxruntime_perf_test -onnxruntime_test_all +--use_acl --acl_home=/path/to/ComputeLibrary --acl_libs=/path/to/ComputeLibrary/build ``` -### Native Build Instructions -{: .no_toc } - -*Validated on Jetson Nano and Jetson Xavier* - -1. Build ACL Library (skip if already built) - - ```bash - cd ~ - git clone -b v20.02 https://github.com/Arm-software/ComputeLibrary.git - cd ComputeLibrary - sudo apt-get install -y scons g++-arm-linux-gnueabihf - scons -j8 arch=arm64-v8a Werror=1 debug=0 asserts=0 neon=1 opencl=1 examples=1 build=native - ``` - -1. Cmake is needed to build ONNX Runtime. Because the minimum required version is 3.13, - it is necessary to build CMake from source. Download Unix/Linux sources from https://cmake.org/download/ - and follow https://cmake.org/install/ to build from source. Version 3.17.5 and 3.18.4 have been tested on Jetson. - -1. Build onnxruntime with --use_acl flag with one of the supported ACL version flags. (ACL_1902 | ACL_1905 | ACL_1908 | ACL_2002) - ---- - -## ArmNN +## Arm NN -See more information on the ArmNN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md). +See more information on the Arm NN Execution Provider [here](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md). ### Prerequisites {: .no_toc } @@ -480,7 +429,7 @@ source /opt/fsl-imx-xwayland/4.*/environment-setup-aarch64-poky-linux alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/share/cmake/OEToolchainConfig.cmake" ``` -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices +* See [here](inferencing.md#arm) for information on building for Arm-based devices ### Build Instructions {: .no_toc } @@ -490,20 +439,20 @@ alias cmake="/usr/bin/cmake -DCMAKE_TOOLCHAIN_FILE=$OECORE_NATIVE_SYSROOT/usr/sh ./build.sh --use_armnn ``` -The Relu operator is set by default to use the CPU execution provider for better performance. To use the ArmNN implementation build with --armnn_relu flag +The Relu operator is set by default to use the CPU execution provider for better performance. To use the Arm NN implementation build with --armnn_relu flag ```bash ./build.sh --use_armnn --armnn_relu ``` -The Batch Normalization operator is set by default to use the CPU execution provider. To use the ArmNN implementation build with --armnn_bn flag +The Batch Normalization operator is set by default to use the CPU execution provider. To use the Arm NN implementation build with --armnn_bn flag ```bash ./build.sh --use_armnn --armnn_bn ``` -To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the ArmNN home directory and build directory respectively. -The ARM Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively. +To use a library outside the normal environment you can set a custom path by providing the --armnn_home and --armnn_libs parameters to define the path to the Arm NN home directory and build directory respectively. +The Arm Compute Library home directory and build directory must also be available, and can be specified if needed using --acl_home and --acl_libs respectively. ```bash ./build.sh --use_armnn --armnn_home /path/to/armnn --armnn_libs /path/to/armnn/build --acl_home /path/to/ComputeLibrary --acl_libs /path/to/acl/build @@ -519,7 +468,7 @@ See more information on the RKNPU Execution Provider [here](../execution-provide * Supported platform: RK1808 Linux -* See [Build ARM](inferencing.md#arm) below for information on building for ARM devices +* See [here](inferencing.md#arm) for information on building for Arm-based devices * Use gcc-linaro-6.3.1-2017.05-x86_64_aarch64-linux-gnu instead of gcc-linaro-6.3.1-2017.05-x86_64_arm-linux-gnueabihf, and modify CMAKE_CXX_COMPILER & CMAKE_C_COMPILER in tool.cmake: ``` diff --git a/docs/build/inferencing.md b/docs/build/inferencing.md index 4f9886913d078..125623ef28399 100644 --- a/docs/build/inferencing.md +++ b/docs/build/inferencing.md @@ -88,7 +88,8 @@ If you would like to use [Xcode](https://developer.apple.com/xcode/) to build th Without this flag, the cmake build generator will be Unix makefile by default. -Today, Mac computers are either Intel-Based or Apple silicon(aka. ARM) based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate ARM binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac ARM computer, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example: +Today, Mac computers are either Intel-Based or Apple silicon-based. By default, ONNX Runtime's build script only generate bits for the CPU ARCH that the build machine has. If you want to do cross-compiling: generate arm64 binaries on a Intel-Based Mac computer, or generate x86 binaries on a Mac +system with Apple silicon, you can set the "CMAKE_OSX_ARCHITECTURES" cmake variable. For example: Build for Intel CPUs: ```bash @@ -107,6 +108,61 @@ The last command will generate a fat-binary for both CPU architectures. Note: unit tests will be skipped due to the incompatible CPU instruction set when doing cross-compiling. +#### AIX +In AIX, you can build ONNX Runtime for 64bit using + +* IBM Open XL compiler tool chain. + Minimum required AIX OS version is 7.2. You need to have 17.1.2 compiler PTF5 (17.1.2.5) version. +* GNU GCC compiler tool chain. + Minimum required AIX OS version is 7.3. GCC version 10.3+ is required. + +For IBM Open XL, export below environment settings. +```bash +ulimit -m unlimited +ulimit -d unlimited +ulimit -n 2000 +ulimit -f unlimited +export OBJECT_MODE=64 +export BUILD_TYPE="Release" +export CC="/opt/IBM/openxlC/17.1.2/bin/ibm-clang" +export CXX="/opt/IBM/openxlC/17.1.2/bin/ibm-clang++_r" +export CFLAGS="-pthread -m64 -D_ALL_SOURCE -mcmodel=large -Wno-deprecate-lax-vec-conv-all -Wno-unused-but-set-variable -Wno-unused-command-line-argument -maltivec -mvsx -Wno-unused-variable -Wno-unused-parameter -Wno-sign-compare" +export CXXFLAGS="-pthread -m64 -D_ALL_SOURCE -mcmodel=large -Wno-deprecate-lax-vec-conv-all -Wno-unused-but-set-variable -Wno-unused-command-line-argument -maltivec -mvsx -Wno-unused-variable -Wno-unused-parameter -Wno-sign-compare" +export LDFLAGS="-L$PWD/build/Linux/$BUILD_TYPE/ -lpthread" +export LIBPATH="$PWD/build/Linux/$BUILD_TYPE/" +``` +For GCC, export below environment settings. +```bash +ulimit -m unlimited +ulimit -d unlimited +ulimit -n 2000 +ulimit -f unlimited +export OBJECT_MODE=64 +export BUILD_TYPE="Release" +export CC="gcc" +export CXX="g++" +export CFLAGS="-maix64 -pthread -DFLATBUFFERS_LOCALE_INDEPENDENT=0 -maltivec -mvsx -Wno-unused-function -Wno-unused-variable -Wno-unused-parameter -Wno-sign-compare -fno-extern-tls-init -Wl,-berok " +export CXXFLAGS="-maix64 -pthread -DFLATBUFFERS_LOCALE_INDEPENDENT=0 -maltivec -mvsx -Wno-unused-function -Wno-unused-variable -Wno-unused-parameter -Wno-sign-compare -fno-extern-tls-init -Wl,-berok " +export LDFLAGS="-L$PWD/build/Linux/$BUILD_TYPE/ -Wl,-bbigtoc -lpython3.9" +export LIBPATH="$PWD/build/Linux/$BUILD_TYPE" +``` +To initiate build, run the below command +```bash +./build.sh \ +--config $BUILD_TYPE\ + --build_shared_lib \ + --skip_submodule_sync \ + --cmake_extra_defines CMAKE_INSTALL_PREFIX=$PWD/install \ + --parallel +``` + +* If you want to install the package in a custom directory, then mention the directory location as value of CMAKE_INSTALL_PREFIX. +* In case of IBM Open XL compiler tool chain, It is possible that in AIX 7.2 some of the runtime libraries like libunwind.a needed for onnxruntime, will be missing. To fix this, you can install the relevant file-sets. +* --parallel option in build option. + As name suggest, this option is for parallel building and resource intensive option. So, if your system is not having good amount of memory for each CPU core, then this option can be skipped. +* --allow_running_as_root is needed if root user is triggering the build. + + #### Notes * Please note that these instructions build the debug build, which may have performance tradeoffs. The "--config" parameter has four valid values: Debug, Release, RelWithDebInfo and MinSizeRel. Compared to "Release", "RelWithDebInfo" not only has debug info, it also disables some inlines to make the binary easier to debug. Thus RelWithDebInfo is slower than Release. @@ -131,13 +187,14 @@ Note: unit tests will be skipped due to the incompatible CPU instruction set whe ### Architectures {: .no_toc } -| | x86_32 | x86_64 | ARM32v7 | ARM64 | PPC64LE | RISCV64 | -|-----------|:------------:|:------------:|:------------:|:------------:|:-------:|:-------:| -|Windows | YES | YES | YES | YES | NO | NO | -|Linux | YES | YES | YES | YES | YES | YES | -|macOS | NO | YES | NO | NO | NO | NO | -|Android | NO | NO | YES | YES | NO | NO | -|iOS | NO | NO | NO | YES | NO | NO | +| | x86_32 | x86_64 | ARM32v7 | ARM64 | PPC64LE | RISCV64 | PPC64BE | +|-----------|:------------:|:------------:|:------------:|:------------:|:-------:|:-------:| :------:| +|Windows | YES | YES | YES | YES | NO | NO | NO | +|Linux | YES | YES | YES | YES | YES | YES | NO | +|macOS | NO | YES | NO | NO | NO | NO | NO | +|Android | NO | NO | YES | YES | NO | NO | NO | +|iOS | NO | NO | NO | YES | NO | NO | NO | +|AIX | NO | NO | NO | NO | NO | NO | YES | ### Build Environments(Host) {: .no_toc } @@ -311,21 +368,21 @@ ORT_DEBUG_NODE_IO_DUMP_DATA_TO_FILES=1 ``` -### ARM +### Arm -There are a few options for building ONNX Runtime for ARM. +There are a few options for building ONNX Runtime for Arm®-based devices. -First, you may do it on a real ARM device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an ARM container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory. +First, you may do it on a real Arm-based device, or on a x86_64 device with an emulator(like qemu), or on a x86_64 device with a docker container with an emulator(you can run an Arm-based container on a x86_64 PC). Then the build instructions are essentially the same as the instructions for Linux x86_64. However, it wouldn't work if your the CPU you are targeting is not 64-bit since the build process needs more than 2GB memory. -* [Cross compiling for ARM with simulation (Linux/Windows)](#cross-compiling-for-arm-with-simulation-linuxwindows) - **Recommended**; Easy, slow, ARM64 only(no support for ARM32) +* [Cross compiling for Arm-based devices with simulation (Linux/Windows)](#cross-compiling-for-arm-based-devices-with-simulation-linuxwindows) - **Recommended**; Easy, slow, ARM64 only(no support for ARM32) * [Cross compiling on Linux](#cross-compiling-on-linux) - Difficult, fast * [Cross compiling on Windows](#cross-compiling-on-windows) -#### Cross compiling for ARM with simulation (Linux/Windows) +#### Cross compiling for Arm-based devices with simulation (Linux/Windows) *EASY, SLOW, RECOMMENDED* -This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every ARM instruction to x86. This is much faster than compiling natively on a low-end ARM device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an ARM device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit. +This method relies on qemu user mode emulation. It allows you to compile using a desktop or cloud VM through instruction level simulation. You'll run the build on x86 CPU and translate every Arm architecture instruction to x86. This is potentially much faster than compiling natively on a low-end device. The resulting ONNX Runtime Python wheel (.whl) file is then deployed to an Arm-based device where it can be invoked in Python 3 scripts. The build process can take hours, and may run of memory if the target CPU is 32-bit. #### Cross compiling on Linux @@ -364,12 +421,12 @@ This option is very fast and allows the package to be built in minutes, but is c You must also know what kind of flags your target hardware need, which can differ greatly. For example, if you just get the normal ARMv7 compiler and use it for Raspberry Pi V1 directly, it won't work because Raspberry Pi only has ARMv6. Generally every hardware vendor will provide a toolchain; check how that one was built. - A target env is identifed by: + A target env is identified by: * Arch: x86_32, x86_64, armv6,armv7,arvm7l,aarch64,... * OS: bare-metal or linux. * Libc: gnu libc/ulibc/musl/... - * ABI: ARM has mutilple ABIs like eabi, eabihf... + * ABI: Arm has multiple ABIs like eabi, eabihf... You can get all these information from the previous output, please be sure they are all correct. @@ -528,8 +585,8 @@ This option is very fast and allows the package to be built in minutes, but is c **Using Visual C++ compilers** -1. Download and install Visual C++ compilers and libraries for ARM(64). - If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding ARM(64) compilers and libraries. +1. Download and install Visual C++ compilers and libraries for Arm(64). + If you have Visual Studio installed, please use the Visual Studio Installer (look under the section `Individual components` after choosing to `modify` Visual Studio) to download and install the corresponding Arm(64) compilers and libraries. 2. Use `.\build.bat` and specify `--arm` or `--arm64` as the build option to start building. Preferably use `Developer Command Prompt for VS` or make sure all the installed cross-compilers are findable from the command prompt being used to build using the PATH environmant variable. diff --git a/docs/execution-providers/CUDA-ExecutionProvider.md b/docs/execution-providers/CUDA-ExecutionProvider.md index 97374ff6e096d..81c0c4d270de3 100644 --- a/docs/execution-providers/CUDA-ExecutionProvider.md +++ b/docs/execution-providers/CUDA-ExecutionProvider.md @@ -35,12 +35,13 @@ Because of [Nvidia CUDA Minor Version Compatibility](https://docs.nvidia.com/dep ONNX Runtime built with cuDNN 8.x is not compatible with cuDNN 9.x, and vice versa. You can choose the package based on CUDA and cuDNN major versions that match your runtime environment (For example, PyTorch 2.3 uses cuDNN 8.x, while PyTorch 2.4 or later used cuDNN 9.x). -### CUDA 12.x +Note: starting ORT 1.19, **CUDA 12.x** becomes default version when distributing ONNX Runtime GPU packages in pypi. -To install CUDA 12 package, please look at [Install ORT](../install). +### CUDA 12.x | ONNX Runtime | CUDA | cuDNN | Notes | |---------------|--------|-------|----------------------------------------------------------------------| +| 1.19.x | 12.x | 9.x | Avaiable in pypi. Compatible with PyTorch >= 2.4.0 for cuda 12.x. | | 1.18.1 | 12.x | 9.x | cuDNN 9 is required. No Java package. | | 1.18.0 | 12.x | 8.x | Java package is added. | | 1.17.x | 12.x | 8.x | Only C++/C# Nuget and Python packages are released. No Java package. | @@ -49,7 +50,8 @@ To install CUDA 12 package, please look at [Install ORT](../install). | ONNX Runtime | CUDA | cuDNN | Notes | |----------------------|--------|-----------------------------------------|---------------------------------------------------------------------------------------------------------------------------------------------| -| 1.18.x | 11.8 | 8.x | | +| 1.19.x | 11.8 | 8.x | Not available in pypi. See [Install ORT](../install) for detail. Compatible with PyTorch <= 2.3.1 for CUDA 11.8. | +| 1.18.x | 11.8 | 8.x | Available in pypi | | 1.17
1.16
1.15 | 11.8 | 8.2.4 (Linux)
8.5.0.96 (Windows) | Tested with CUDA versions from 11.6 up to 11.8, and cuDNN from 8.2 up to 8.9 | | 1.14
1.13 | 11.6 | 8.2.4 (Linux)
8.5.0.96 (Windows) | libcudart 11.4.43
libcufft 10.5.2.100
libcurand 10.2.5.120
libcublasLt 11.6.5.2
libcublas 11.6.5.2
libcudnn 8.2.4 | | 1.12
1.11 | 11.4 | 8.2.4 (Linux)
8.2.2.26 (Windows) | libcudart 11.4.43
libcufft 10.5.2.100
libcurand 10.2.5.120
libcublasLt 11.6.5.2
libcublas 11.6.5.2
libcudnn 8.2.4 | diff --git a/docs/execution-providers/CoreML-ExecutionProvider.md b/docs/execution-providers/CoreML-ExecutionProvider.md index af752b1a85e7e..6ffa77edc60b5 100644 --- a/docs/execution-providers/CoreML-ExecutionProvider.md +++ b/docs/execution-providers/CoreML-ExecutionProvider.md @@ -128,10 +128,12 @@ Operators that are supported by the CoreML Execution Provider when a NeuralNetwo |ai.onnx.ReduceSum|| |ai.onnx:Relu|| |ai.onnx:Reshape|| -|ai.onnx:Resize|| +|ai.onnx:Resize|4D input.
`coordinate_transformation_mode` == `asymmetric`.
`mode` == `linear` or `nearest`.
`nearest_mode` == `floor`.
`exclude_outside` == false
`scales` or `sizes` must be constant.| |ai.onnx:Shape|Attribute `start` with non-default value is not supported.
Attribute `end` is not supported.| |ai.onnx:Sigmoid|| |ai.onnx:Slice|Inputs `starts`, `ends`, `axes`, and `steps` should be constant. Empty slice is not supported.| +|ai.onnx:Softmax|| +|ai.onnx:Split|If provided, `splits` must be constant.| |ai.onnx:Squeeze|| |ai.onnx:Sqrt|| |ai.onnx:Sub|| @@ -147,15 +149,26 @@ Operators that are supported by the CoreML Execution Provider when a MLProgram m |ai.onnx:Add|| |ai.onnx:AveragePool|Only 2D Pool is supported currently. 3D and 5D support can be added if needed.| |ai.onnx:Clip|| +|ai.onnx:Concat|| |ai.onnx:Conv|Only 1D/2D Conv is supported.
Bias if provided must be constant.| +|ai.onnx:ConvTranspose|Weight and bias must be constant.
padding_type of SAME_UPPER/SAME_LOWER is not supported.
kernel_shape must have default values.
output_shape is not supported.
output_padding must have default values.| +|ai.onnx.DepthToSpace|If 'mode' is 'CRD' the input must have a fixed shape.| |ai.onnx:Div|| |ai.onnx:Gemm|Input B must be constant.| |ai.onnx:GlobalAveragePool|Only 2D Pool is supported currently. 3D and 5D support can be added if needed.| |ai.onnx:GlobalMaxPool|Only 2D Pool is supported currently. 3D and 5D support can be added if needed.| +|ai.onnx:GridSample|4D input.
'mode' of 'linear' or 'zeros'.
(mode==linear && padding_mode==reflection && align_corners==0) is not supported.| +|ai.onnx.LeakyRelu|| |ai.onnx:MatMul|Only support for transA == 0, alpha == 1.0 and beta == 1.0 is currently implemented.| |ai.onnx:MaxPool|Only 2D Pool is supported currently. 3D and 5D support can be added if needed.| |ai.onnx:Mul|| |ai.onnx:Pow|Only supports cases when both inputs are fp32.| |ai.onnx:Relu|| |ai.onnx:Reshape|| +|ai.onnx:Resize|See [resize_op_builder.cc](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/core/providers/coreml/builders/impl/resize_op_builder.cc) implementation. There are too many permutations to describe the valid combinations.| +|ai.onnx.Slice|starts/ends/axes/steps must be constant initializers.| +|ai.onnx.Split|If provided, `splits` must be constant.| |ai.onnx:Sub|| +|ai.onnx:Sigmoid|| +|ai.onnx:Tanh|| +|ai.onnx:Transpose|| diff --git a/docs/execution-providers/EP-Context-Design.md b/docs/execution-providers/EP-Context-Design.md new file mode 100644 index 0000000000000..8e5ffcbb962dd --- /dev/null +++ b/docs/execution-providers/EP-Context-Design.md @@ -0,0 +1,82 @@ +--- +title: EP context design +description: ONNX Runtime EP context cache feature design +parent: Execution Providers +nav_order: 16 +redirect_from: /docs/reference/execution-providers/EP-Context-Design +--- + +# OnnxRuntime EP context cache feature design +{: .no_toc } + +## Contents +{: .no_toc } + +* TOC placeholder +{:toc} + +## Background + +OnnxRuntime Execution Providers enable users to inference Onnx model on different kinds of hardware accelerators empowered by backend SDKs (like QNN, OpenVINO, Vitis AI, etc). The Execution Providers converts the Onnx model into graph format required by the backend SDK, and compiles it into the format required by the hardware. Specific to NPU world, the converting and compiling process takes a long time to complete, especially for LLM models. The session creation time costs tens of minutes for some cases which impacts the user experience badly. +To avoid the converting and compiling cost, most of the backend SDKs provide the feature to dump the pre-compiled model into binary file. The pre-compiled model can be loaded by backend SDK directly and executed on the target device. It improves the session creation time greatly by using this way. In order to achieve this, OnnxRuntime defined a contribute Op called EPContext in MS domain. + +## EPContext Op Schema + +Op domain: com.microsoft +Node inputs & outputs: variadic +Domain: com.microsoft +Atrribures: + +|Attributes |Data type|Description | +|---------------------|---------|----------------------------------------------------------------------------------------------------------| +|main_context |int64 |1 (default): This node points to an EP context content that contains the graph referred to by this node.
0: The node does not point to any EP context content. Expect to get the graph from node with this field is 1.
Some EPs support 1 single context contains multiple graphs. The EPContext node with main_context=1 refers to the real context. And the context contains graphs that are referred by other nodes with main_context=0.| +|ep_cache_context |string |Payload of the EP context if embed_mode=1, or path to the context file if embed_mode=0.
The path is a relative path to the Onnx model file. It can be a file name, or subfolder/filename| +|embed_mode |int64 |1(default): ep_cache_context contains the payload of context content.
0: ep_cache_context is the context binary file path.| +|ep_sdk_version |string |Optional. SDK version that used to generate the node. | +|onnx_model_filename |string |Optional. Original Onnx model file name. | +|hardware_architecture|string |Optional. Hardware architecture.| +|partition_name |string |Optional. OnnxRuntime partitioned graph name.| +|source |string |Optional. The source used to generate the node. Should be a key identified by the EP so that OnnxRuntime can support multiple EPContext nodes run with different EPs. For example, QNN EP only accepts nodes with source=QNN or QnnExecutionProvider, OpenVINO EP only accepts nodes with source=OpenVINOExecutionProvider.| +|notes |string |Optional. Additional information required by specific EP. | + +

EP Context node example

+ +## OnnxRuntime Session options related to EP context cache generation and inference + +|Session option |Description | +|---------------------------|----------------------------------------------------------------------------------------------------------| +|ep.context_enable |Used for context model generation only.
1: Enable OnnxRuntime to dump the context cache model.
0 (default): disable.| +|ep.context_file_path |Specify the file path for the dump model.
Default to original_file_name.onnx_ctx.onnx for context model generation.
For model inference, if user loads model from memory buffer and the EP context binary is outside the Onnx model, user need to set this option. OnnxRuntime EP use this path to get the folder path together with the ep_cache_context (which point to the contex binary path) to get the absoluate path for the context binary file.| +|ep.context_embed_mode |Used for context model generation only.
1 (default): dump the EP context content into the Onnx model, inside ep_cache_context node attribute.
0: dump the EP context content into a separate file, keep the file name in the Onnx model. File path tracked in ep_cache_context node attribute.| +|ep.context_node_name_prefix|Used for context model generation only.
Specify the EPContext node name (also the partition_name attribute, internal graph name) prefix to make it unique across nodes in case user glue multiple EPContext nodes in one model to avoid conflict.| + +## EP Context cache model generation workflow + +OnnxRuntime EPs should flows these rules to create the EP context cache model to maintain a unified user interface. +1. ep.context_enable + OnnxRuntime create the EP context cache model if ep.context_enable = 1. Otherwise, ep.context_enable = 0 (default), just do the normal workflow. +2. ep.context_file_path + OnnxRuntime just append “_ctx.onnx” to the input file name as the output file name if no ep.context_file_path provided. Otherwise just use the user provided file path. + ep.context_file_path is required if user loads the model from memory buffer, since there’s no way for OnnxRuntime to get the input file path for this scenario. +3. ep.context_embed_mode + 1 (default): dump the EP context context content into the Onnx model. + 0: dump the EP context content as a separate file. EP decides the file name and tracks the file name in EPContext node attribute ep_cache_context. The separate file should always at the same location as the dumped Onnx model file. And the file path tracked in EPContext node is a relative path to the Onnx model file. Note: subfolder is allowed. +4. ep.context_node_name_prefix + In case the user wants to add special tag inside the EPContext node name (also the partition_name attribute, and graph name), EP should provide this capability when EP creates the EPContext nodes. + This is useful if the user wants to glue multiple EPContext nodes from multiple models into one model and there’s risk that node name (graph name) confliction happens across models. Dependes on EP implementation. QNN EP supports multiple EPContext nodes, so user can merge and re-connect EPContext nodes from different models. + +## Inference from EP Context cache model workflow + +OnnxRuntime EPs which support loading from Onnx model with EPContext nodes should follow the workflow/rules for model inference. +1. EP should be able to identify the model which has EPContext node. + a. EP follows its normal workflow if there’s no EPContext nodes inside the model. + b. If it is the Onnx model has EPContext nodes. + i. EP should check the source node attribute from all EPContext nodes to make sure there is any EPContext node for this EP (the source node attribute matches the key required by the EP). + ii. EP only partition in the EPContext nodes which has source node attribute matches the key required by the EP. + iii. EP loads from the cached context inside EPContext node +2. If the context cache Onnx model is dumped with embed_mode = 1, so there is separate context binary file beside the Onnx model in the same folder. + a. OnnxRuntime EP gets the context binary file relative path from EPContext ep_cache_context node attribute. + b. If the user loads the model from a Onnx model file path, then EP should get the input model folder path, and combine it with the relative path got from step a) as the context binary file full path. + c. If the user loads the model from memory buffer, user needs to provide session option ep.context_file_path. EP gets the folder path from ep.context_file_path, and combines it with the relative path got from step a) as the context binary file full path. + +

EP Context nodes with different EPs

diff --git a/docs/execution-providers/OpenVINO-ExecutionProvider.md b/docs/execution-providers/OpenVINO-ExecutionProvider.md index 39ec668bc0bf9..fa71f70b0c277 100644 --- a/docs/execution-providers/OpenVINO-ExecutionProvider.md +++ b/docs/execution-providers/OpenVINO-ExecutionProvider.md @@ -20,7 +20,7 @@ Accelerate ONNX models on Intel CPUs, GPUs, NPU with Intel OpenVINO™ Execution ## Install Pre-built packages and Docker images are published for OpenVINO™ Execution Provider for ONNX Runtime by Intel for each release. -* OpenVINO™ Execution Provider for ONNX Runtime Release page: [Latest v5.2 Release](https://github.com/intel/onnxruntime/releases) +* OpenVINO™ Execution Provider for ONNX Runtime Release page: [Latest v5.4 Release](https://github.com/intel/onnxruntime/releases) * Python wheels Ubuntu/Windows: [onnxruntime-openvino](https://pypi.org/project/onnxruntime-openvino/) * Docker image: [openvino/onnxruntime_ep_ubuntu20](https://hub.docker.com/r/openvino/onnxruntime_ep_ubuntu20) @@ -30,10 +30,9 @@ ONNX Runtime OpenVINO™ Execution Provider is compatible with three lastest rel |ONNX Runtime|OpenVINO™|Notes| |---|---|---| +|1.19.0|2024.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.4)| +|1.18.0|2024.1|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.3)| |1.17.1|2023.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.2)| -|1.16.0|2023.1|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.1)| -|1.15.0|2023.0|[Details](https://github.com/intel/onnxruntime/releases/tag/v5.0.0)| -|1.14.0|2022.3|[Details](https://github.com/intel/onnxruntime/releases/tag/v4.3)| ## Build @@ -200,8 +199,30 @@ For more information on Multi-Device plugin of OpenVINO™, please refer to the [Intel OpenVINO™ Multi Device Plugin](https://docs.openvino.ai/latest/openvino_docs_OV_UG_Running_on_multiple_devices.html). ### Export OpenVINO Compiled Blob -Export the OpenVINO compiled blob as an ONNX model. Using this ONNX model for subsequent inferences avoids model recompilation and could have a positive impact on Session creation time. The exported model is saved to the same directory as the source model with the suffix -ov_{device}_blob.onnx where device can be one of the supported like CPU or NPU. This feature is currently enabled for fully supported models only. -Refer to [Configuration Options](#configuration-options) for more information about using these runtime options. +Export the OpenVINO compiled blob as an ONNX model. Using this ONNX model for subsequent inferences avoids model recompilation and could have a positive impact on Session creation time. This feature is currently enabled for fully supported models only. It complies with the ORT session config keys +``` + Ort::SessionOptions session_options; + + // Enable EP context feature to dump the partitioned graph which includes the EP context into Onnx file. + // "0": disable. (default) + // "1": enable. + + session_options.AddConfigEntry(kOrtSessionOptionEpContextEnable, "1"); + + // Flag to specify whether to dump the EP context into single Onnx model or pass bin path. + // "0": dump the EP context into separate file, keep the file name in the Onnx model. + // "1": dump the EP context into the Onnx model. (default). + + session_options.AddConfigEntry(kOrtSessionOptionEpContextEmbedMode, "1"); + + // Specify the file path for the Onnx model which has EP context. + // Defaults to /original_file_name_ctx.onnx if not specified + + session_options.AddConfigEntry(kOrtSessionOptionEpContextFilePath, ".\ov_compiled_epctx.onnx"); + + sess = onnxruntime.InferenceSession(, session_options) +``` +Refer to [Session Options](https://github.com/microsoft/onnxruntime/blob/main/include/onnxruntime/core/session/onnxruntime_session_options_config_keys.h) for more information about session options. ### Enable QDQ Optimizations Passes Optimizes ORT quantized models for the NPU device to only keep QDQs for supported ops and optimize for performance and accuracy.Generally this feature will give better performance/accuracy with ORT Optimizations disabled. @@ -239,8 +260,7 @@ The session configuration options are passed to SessionOptionsAppendExecutionPro ``` OrtOpenVINOProviderOptions options; -options.device_type = "GPU"; -options.precision = "FP32"; +options.device_type = "GPU_FP32"; options.num_of_threads = 8; options.cache_dir = ""; options.context = 0x123456ff; @@ -277,7 +297,6 @@ The following table lists all the available configuration options for API 2.0 an | context | string | OpenCL Context | void* | This option is only available when OpenVINO EP is built with OpenCL flags enabled. It takes in the remote context i.e the cl_context address as a void pointer.| | enable_opencl_throttling | string | True/False | boolean | This option enables OpenCL queue throttling for GPU devices (reduces CPU utilization when using GPU). | | enable_qdq_optimizer | string | True/False | boolean | This option enables QDQ Optimization to improve model performance and accuracy on NPU. | -| export_ep_ctx_blob | string | True/False | boolean | This options enables exporting the OpenVINO Compiled Blob as an ONNX Operator EPContext. | Valid Hetero or Multi or Auto Device combinations: diff --git a/docs/execution-providers/QNN-ExecutionProvider.md b/docs/execution-providers/QNN-ExecutionProvider.md index 7558ea51582e1..1cf50ecadc517 100644 --- a/docs/execution-providers/QNN-ExecutionProvider.md +++ b/docs/execution-providers/QNN-ExecutionProvider.md @@ -431,6 +431,51 @@ g_ort->AddSessionConfigEntry(session_options, kOrtSessionOptionEpContextEmbedMod options.add_session_config_entry("ep.context_embed_mode", "0") ``` +## QNN EP weight sharing + +### Weight sharing in Onnx domain +Weight sharing in Onnx means multiple Onnx models with external weights point to the same external weight file. The Onnx models share same tensor names so that they reference to the same tensor data. +

Weight sharing across Onnx models

+ +### Weight sharing in QNN domain +QNN weight sharing is enabled with QNN pre-generated QNN context binary. It requires users to generate context binary offline on Linux x86_64 or Windows x86_64 machine (Windows support since QNN 2.26). The QNN context binary contains multiple graphs which share the same tensors. +

Weight sharing in QNN context binary

+ +### Weight sharing in QNN domain +The way OnnxRuntime to convert Onnx model with weight sharing to QNN context binary with weight sharing. +1. Create QNN context with weight sharing configuration enabled. +2. Convert and compile model1.onnx into QNN context (get Qnn graph1). +3. Convert and compile model2.onnx into QNN context (get Qnn graph2). +4. Repeat step 2 if more models. +5. Generated the QNN context binary file, generated wrapped Onnx model with EPContext nodes. +OnnxRuntime QNN EP provides [OnnxRuntime_qnn_ctx_gen](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/qnn_ctx_gen) tool to complete these steps. +Example command line: +``` +./onnxruntime_qnn_ctx_gen -i "soc_model|60 htp_graph_finalization_optimization_mode|3" ./model1.onnx,./model2.onnx +``` +It creates 2 Onnx model (model1.onnx_ctx.onnx, model2.onnx_ctx.onnx) and a QNN context binary file (model2.onnx_ctx.onnx_xxx.bin). +

Weight sharing from Onnx to QNN

+If user creates the QNN context binary .bin file weight sharing from QNN toolchain (qnn-context-binary-generator). The context binary .bin file looks the same. User needs to create model1.onnx and model2.onnx with EPContext node which points to this .bin file. Each EPContext node should refer (node name and partition_name) to different Qnn graph names from the QNN context. Here’s an example script for reference [gen_qnn_ctx_onnx_model.py](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/qnn/gen_qnn_ctx_onnx_model.py) which wraps one single QNN graph into EPContext node. + +### Inference with QNN resource sharing workflow +OnnxRuntime inference session need to have resource sharing enabled (set session option ep.share_ep_contexts to 1) to use the dumped Qnn context model with weight sharing enabled. +1. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model1.onnx_ctx.onnx model. + 1.1 The session loads the model1.onnx_ctx.onnx model. + 1.2 The shared place is empty. + 1.3 EPContext node1 in model1.onnx_ctx.onnx specifies that it uses Qnn_graph1 + 1.4 QNN EP loads the qnn_ctx.bin and deserialize the binary to get Qnn graphs (Qnn_graph1, Qnn_graph2). + 1.5 Uses Qnn_graph1 for this OnnxRuntime session. + 1.6 Put the Qnn_graph2 into the shared place. +2. Create OnnxRuuntime inference session with ep.share_ep_contexts=1, loads the model2.onnx_ctx.onnx model. + 2.1 The session loads the model2.onnx_ctx.onnx model. + 2.2 The EPContext node2 in model2.onnx_ctx.onnx specifies that it uses Qnn_graph2. + 2.3 The shared place has Qnn_graph2. + 2.4 QNN EP skips loading qnn_ctx.bin since it gets what it wants from the shared place. + 2.5 Uses Qnn_graph2 from the shared place for this session. +3. To avoid issues while existing execution, user needs to destroy the 2nd session first, then the 1st session. + +[Code example](https://github.com/microsoft/onnxruntime/blob/291a5352b27ded5714e5748b381f2efb88f28fb9/onnxruntime/test/providers/qnn/qnn_ep_context_test.cc#L979-L992). + ## Usage ### C++ C API details are [here](../get-started/with-c.md). diff --git a/docs/execution-providers/TensorRT-ExecutionProvider.md b/docs/execution-providers/TensorRT-ExecutionProvider.md index 3671f418c5078..ded86899eee6e 100644 --- a/docs/execution-providers/TensorRT-ExecutionProvider.md +++ b/docs/execution-providers/TensorRT-ExecutionProvider.md @@ -27,21 +27,24 @@ See [Build instructions](../build/eps.md#tensorrt). ## Requirements -| ONNX Runtime | TensorRT | CUDA | -| :----------- | :------- | :--------- | -| 1.18-main | 10.0 | 11.8, 12.2 | -| 1.17 | 8.6 | 11.8, 12.2 | -| 1.16 | 8.6 | 11.8 | -| 1.15 | 8.6 | 11.8 | -| 1.14 | 8.5 | 11.6 | -| 1.12-1.13 | 8.4 | 11.4 | -| 1.11 | 8.2 | 11.4 | -| 1.10 | 8.0 | 11.4 | -| 1.9 | 8.0 | 11.4 | -| 1.7-1.8 | 7.2 | 11.0.3 | -| 1.5-1.6 | 7.1 | 10.2 | -| 1.2-1.4 | 7.0 | 10.1 | -| 1.0-1.1 | 6.0 | 10.0 | +Note: starting ORT 1.19, **CUDA 12** becomes default version when distributing ONNX Runtime GPU packages. + +| ONNX Runtime | TensorRT | CUDA | +| :----------- | :------- | :------------- | +| 1.19-main | 10.2 | **12.x**, 11.8 | +| 1.18 | 10.0 | 11.8, 12.x | +| 1.17 | 8.6 | 11.8, 12.x | +| 1.16 | 8.6 | 11.8 | +| 1.15 | 8.6 | 11.8 | +| 1.14 | 8.5 | 11.6 | +| 1.12-1.13 | 8.4 | 11.4 | +| 1.11 | 8.2 | 11.4 | +| 1.10 | 8.0 | 11.4 | +| 1.9 | 8.0 | 11.4 | +| 1.7-1.8 | 7.2 | 11.0.3 | +| 1.5-1.6 | 7.1 | 10.2 | +| 1.2-1.4 | 7.0 | 10.1 | +| 1.0-1.1 | 6.0 | 10.0 | For more details on CUDA/cuDNN versions, please see [CUDA EP requirements](./CUDA-ExecutionProvider.md#requirements). @@ -565,7 +568,7 @@ export ORT_TENSORRT_CONTEXT_MEMORY_SHARING_ENABLE=1 ## TensorRT EP Caches -There are three major TRT EP cahces: +There are three major TRT EP caches: * TRT timing cache * TRT engine cache * Embedded engine model / EPContext model diff --git a/docs/execution-providers/Vitis-AI-ExecutionProvider.md b/docs/execution-providers/Vitis-AI-ExecutionProvider.md index 655b563bcaff4..6e95434e2b7c5 100644 --- a/docs/execution-providers/Vitis-AI-ExecutionProvider.md +++ b/docs/execution-providers/Vitis-AI-ExecutionProvider.md @@ -27,9 +27,9 @@ The following table lists AMD targets that are supported by the Vitis AI ONNX Ru | **Architecture** | **Family** | **Supported Targets** | **Supported OS** | |---------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------|------------------------------------------------------------| | AMD64 | Ryzen AI | AMD Ryzen 7040U, 7040HS | Windows | -| ARM64 Cortex-A53 | Zynq UltraScale+ MPSoC | ZCU102, ZCU104, KV260 | Linux | -| ARM64 Cortex-A72 | Versal AI Core / Premium | VCK190 | Linux | -| ARM64 Cortex-A72 | Versal AI Edge | VEK280 | Linux | +| Arm® Cortex®-A53 | Zynq UltraScale+ MPSoC | ZCU102, ZCU104, KV260 | Linux | +| Arm® Cortex®-A72 | Versal AI Core / Premium | VCK190 | Linux | +| Arm® Cortex®-A72 | Versal AI Edge | VEK280 | Linux | AMD Adaptable SoC developers can also leverage the Vitis AI ONNX Runtime Execution Provider to support custom (chip-down) designs. diff --git a/docs/execution-providers/Xnnpack-ExecutionProvider.md b/docs/execution-providers/Xnnpack-ExecutionProvider.md index c1900aa841860..f58929a0d6c1a 100644 --- a/docs/execution-providers/Xnnpack-ExecutionProvider.md +++ b/docs/execution-providers/Xnnpack-ExecutionProvider.md @@ -8,7 +8,7 @@ nav_order: 9 # XNNPACK Execution Provider -Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for ARM, WebAssembly, and x86 platforms. +Accelerate ONNX models on Android/iOS devices and WebAssembly with ONNX Runtime and the XNNPACK execution provider. [XNNPACK](https://github.com/google/XNNPACK) is a highly optimized library of floating-point neural network inference operators for Arm®-based, WebAssembly, and x86 platforms. ## Contents {: .no_toc } diff --git a/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md b/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md index f894dcc86f1a1..02a0edf4e743d 100644 --- a/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md +++ b/docs/execution-providers/community-maintained/ACL-ExecutionProvider.md @@ -10,14 +10,7 @@ redirect_from: /docs/reference/execution-providers/ACL-ExecutionProvider # ACL Execution Provider {: .no_toc } -The integration of ACL as an execution provider (EP) into ONNX Runtime accelerates performance of ONNX model workloads across Armv8 cores. [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"} is an open source inference engine maintained by Arm and Linaro companies. - - -## Contents -{: .no_toc } - -* TOC placeholder -{:toc} +The ACL Execution Provider enables accelerated performance on Arm®-based CPUs through [Arm Compute Library](https://github.com/ARM-software/ComputeLibrary){:target="_blank"}. ## Build @@ -30,10 +23,44 @@ For build instructions, please see the [build page](../../build/eps.md#arm-compu ``` Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"}; Ort::SessionOptions sf; -bool enable_cpu_mem_arena = true; -Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_cpu_mem_arena)); +bool enable_fast_math = true; +Ort::ThrowOnError(OrtSessionOptionsAppendExecutionProvider_ACL(sf, enable_fast_math)); ``` The C API details are [here](../../get-started/with-c.html). +### Python +{: .no_toc } + +``` +import onnxruntime + +providers = [("ACLExecutionProvider", {"enable_fast_math": "true"})] +sess = onnxruntime.InferenceSession("model.onnx", providers=providers) +``` + ## Performance Tuning -When/if using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag -e acl +Arm Compute Library has a fast math mode that can increase performance with some potential decrease in accuracy for MatMul and Conv operators. It is disabled by default. + +When using [onnxruntime_perf_test](https://github.com/microsoft/onnxruntime/tree/main/onnxruntime/test/perftest){:target="_blank"}, use the flag `-e acl` to enable the ACL Execution Provider. You can additionally use `-i 'enable_fast_math|true'` to enable fast math. + +Arm Compute Library uses the ONNX Runtime intra-operator thread pool when running via the execution provider. You can control the size of this thread pool using the `-x` option. + +## Supported Operators + +|Operator|Supported types| +|---|---| +|AveragePool|float| +|BatchNormalization|float| +|Concat|float| +|Conv|float, float16| +|FusedConv|float| +|FusedMatMul|float, float16| +|Gemm|float| +|GlobalAveragePool|float| +|GlobalMaxPool|float| +|MatMul|float, float16| +|MatMulIntegerToFloat|uint8, int8, uint8+int8| +|MaxPool|float| +|NhwcConv|float| +|Relu|float| +|QLinearConv|uint8, int8, uint8+int8| diff --git a/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md b/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md index 57d07af02bc3a..e38a0a75ef92d 100644 --- a/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md +++ b/docs/execution-providers/community-maintained/ArmNN-ExecutionProvider.md @@ -7,7 +7,7 @@ nav_order: 2 redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider --- -# ArmNN Execution Provider +# Arm NN Execution Provider {: .no_toc} ## Contents @@ -16,14 +16,14 @@ redirect_from: /docs/reference/execution-providers/ArmNN-ExecutionProvider * TOC placeholder {:toc} -Accelerate performance of ONNX model workloads across Armv8 cores with the ArmNN execution provider. [ArmNN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. +Accelerate performance of ONNX model workloads across Arm®-based devices with the Arm NN execution provider. [Arm NN](https://github.com/ARM-software/armnn) is an open source inference engine maintained by Arm and Linaro companies. ## Build -For build instructions, please see the [BUILD page](../../build/eps.md#armnn). +For build instructions, please see the [BUILD page](../../build/eps.md#arm-nn). ## Usage ### C/C++ -To use ArmNN as execution provider for inferencing, please register it as below. +To use Arm NN as execution provider for inferencing, please register it as below. ``` Ort::Env env = Ort::Env{ORT_LOGGING_LEVEL_ERROR, "Default"}; Ort::SessionOptions so; diff --git a/docs/execution-providers/index.md b/docs/execution-providers/index.md index 1e2c13abcf67f..52687f6f48d2c 100644 --- a/docs/execution-providers/index.md +++ b/docs/execution-providers/index.md @@ -24,9 +24,9 @@ ONNX Runtime supports many different execution providers today. Some of the EPs |CPU|GPU|IoT/Edge/Mobile|Other| ---|---|---|--- |Default CPU|[NVIDIA CUDA](../execution-providers/CUDA-ExecutionProvider.md)|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[Rockchip NPU](../execution-providers/community-maintained/RKNPU-ExecutionProvider.md) (*preview*)| -|[Intel DNNL](../execution-providers/oneDNN-ExecutionProvider.md)|[NVIDIA TensorRT](../execution-providers/TensorRT-ExecutionProvider.md)|[ARM Compute Library](../execution-providers/community-maintained/ACL-ExecutionProvider.md) (*preview*)|[Xilinx Vitis-AI](../execution-providers/Vitis-AI-ExecutionProvider.md) (*preview*)| +|[Intel DNNL](../execution-providers/oneDNN-ExecutionProvider.md)|[NVIDIA TensorRT](../execution-providers/TensorRT-ExecutionProvider.md)|[Arm Compute Library](../execution-providers/community-maintained/ACL-ExecutionProvider.md) (*preview*)|[Xilinx Vitis-AI](../execution-providers/Vitis-AI-ExecutionProvider.md) (*preview*)| |[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)|[DirectML](../execution-providers/DirectML-ExecutionProvider.md)|[Android Neural Networks API](../execution-providers/NNAPI-ExecutionProvider.md)|[Huawei CANN](../execution-providers/community-maintained/CANN-ExecutionProvider.md) (*preview*)| -|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[AMD MIGraphX](../execution-providers/MIGraphX-ExecutionProvider.md)|[ARM-NN](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md) (*preview*)|[AZURE](../execution-providers/Azure-ExecutionProvider.md) (*preview*)| +|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[AMD MIGraphX](../execution-providers/MIGraphX-ExecutionProvider.md)|[Arm NN](../execution-providers/community-maintained/ArmNN-ExecutionProvider.md) (*preview*)|[AZURE](../execution-providers/Azure-ExecutionProvider.md) (*preview*)| |[XNNPACK](../execution-providers/Xnnpack-ExecutionProvider.md)|[Intel OpenVINO](../execution-providers/OpenVINO-ExecutionProvider.md)|[CoreML](../execution-providers/CoreML-ExecutionProvider.md) (*preview*)| ||[AMD ROCm](../execution-providers/ROCm-ExecutionProvider.md)|[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)| ||[TVM](../execution-providers/community-maintained/TVM-ExecutionProvider.md) (*preview*)|[Qualcomm QNN](../execution-providers/QNN-ExecutionProvider.md)| diff --git a/docs/genai/howto/build-from-source.md b/docs/genai/howto/build-from-source.md index 012d8ea2fd048..1fbcab494e3fa 100644 --- a/docs/genai/howto/build-from-source.md +++ b/docs/genai/howto/build-from-source.md @@ -16,7 +16,7 @@ nav_order: 2 ## Pre-requisites - `cmake` -- `.Net v6` (if building C#) +- `.NET6` (if building C#) ## Clone the onnxruntime-genai repo @@ -25,11 +25,10 @@ git clone https://github.com/microsoft/onnxruntime-genai cd onnxruntime-genai ``` -## Install ONNX Runtime +## Download ONNX Runtime binaries -By default, the onnxruntime-genai build expects to find the ONNX Runtime include and binaries in a folder called `ort` in the root directory of onnxruntime-genai. You can put the ONNX Runtime files in a different location and specify this location to the onnxruntime-genai build via the --ort_home command line argument. +By default, the onnxruntime-genai build expects to find the ONNX Runtime include and binaries in a folder called `ort` in the root directory of onnxruntime-genai. You can put the ONNX Runtime files in a different location and specify this location to the onnxruntime-genai build via the `--ort_home` command line argument. -### Option 1: Install from release These instructions assume you are in the `onnxruntime-genai` folder. @@ -38,9 +37,9 @@ These instructions assume you are in the `onnxruntime-genai` folder. These instruction use `win-x64`. Replace this if you are using a different architecture. ```bash -curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.18.0/onnxruntime-win-x64-1.18.0.zip -o onnxruntime-win-x64-1.18.0.zip -tar xvf onnxruntime-win-x64-1.18.0.zip -move onnxruntime-win-x64-1.18.0 ort +curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.19.2/onnxruntime-win-x64-1.19.2.zip -o onnxruntime-win-x64-1.19.2.zip +tar xvf onnxruntime-win-x64-1.19.2.zip +move onnxruntime-win-x64-1.19.2 ort ``` #### Linux and Mac @@ -48,151 +47,86 @@ move onnxruntime-win-x64-1.18.0 ort These instruction use `linux-x64-gpu`. Replace this if you are using a different architecture. ```bash -curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.18.0/onnxruntime-linux-x64-gpu-1.18.0.tgz -o onnxruntime-linux-x64-gpu-1.18.0.tgz -tar xvzf onnxruntime-linux-x64-gpu-1.18.0.tgz -mv onnxruntime-linux-x64-gpu-1.18.0 ort +curl -L https://github.com/microsoft/onnxruntime/releases/download/v1.19.2/onnxruntime-linux-x64-gpu-1.19.2.tgz -o onnxruntime-linux-x64-gpu-1.19.2.tgz +tar xvzf onnxruntime-linux-x64-gpu-1.19.2.tgz +mv onnxruntime-linux-x64-gpu-1.19.2 ort ``` -### Option 2: Install from nightly +#### Android -Download the nightly nuget package `Microsoft.ML.OnnxRuntime` from: https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly. - -Extract the nuget package. - -```bash -tar xvf Microsoft.ML.OnnxRuntime.1.18.0-dev-20240322-0323-ca825cb6e6.nupkg -``` - -Copy the include and lib files into `ort`. - -On Windows - -Example is given for `win-x64`. Change this to your architecture if different. - -```cmd -copy build\native\include\onnxruntime_c_api.h ort\include -copy runtimes\win-x64\native\*.dll ort\lib -``` - -On Linux - -Example is given for `linux-x64`. Change this to your architecture if different. - -```cmd -cp build/native/include/onnxruntime_c_api.h ort/include -cp build/linux-x64/native/libonnxruntime*.so* ort/lib -``` - -### Option 3: Build from source - -#### Clone the onnxruntime repo +If you do not already have an `ort` folder, create one. ```bash -cd .. -git clone https://github.com/microsoft/onnxruntime.git -cd onnxruntime +mkdir ort ``` -#### Build ONNX Runtime for CPU on Windows - ```bash -build.bat --build_shared_lib --skip_tests --parallel --config Release -copy include\onnxruntime\core\session\onnxruntime_c_api.h ..\onnxruntime-genai\ort\include -copy build\Windows\Release\Release\*.dll ..\onnxruntime-genai\ort\lib -copy build\Windows\Release\Release\onnxruntime.lib ..\onnxruntime-genai\ort\lib -``` - -#### Build ONNX Runtime for DirectML on Windows - -```bash -build.bat --build_shared_lib --skip_tests --parallel --use_dml --config Release -copy include\onnxruntime\core\session\onnxruntime_c_api.h ..\onnxruntime-genai\ort\include -copy include\onnxruntime\core\providers\dml\dml_provider_factory.h ..\onnxruntime-genai\ort\include -copy build\Windows\Release\Release\*.dll ..\onnxruntime-genai\ort\lib -copy build\Windows\Release\Release\onnxruntime.lib ..\onnxruntime-genai\ort\lib +curl -L https://repo1.maven.org/maven2/com/microsoft/onnxruntime/onnxruntime-android/1.19.2/onnxruntime-android-1.19.2.aar -o ort/onnxruntime-android-1.19.2.aar +cd ort +tar xvf onnxruntime-android-1.19.2.aar +cd .. ``` +## Build the generate() API -#### Build ONNX Runtime for CUDA on Windows - -```bash -build.bat --build_shared_lib --skip_tests --parallel --use_cuda --config Release -copy include\onnxruntime\core\session\onnxruntime_c_api.h ..\onnxruntime-genai\ort\include -copy include\onnxruntime\core\providers\cuda\*.h ..\onnxruntime-genai\ort\include -copy build\Windows\Release\Release\*.dll ..\onnxruntime-genai\ort\lib -copy build\Windows\Release\Release\onnxruntime.lib ..\onnxruntime-genai\ort\lib -``` +This step assumes that you are in the root of the onnxruntime-genai repo, and you have followed the previous steps to copy the onnxruntime headers and binaries into the folder specified by , which defaults to `onnxruntime-genai/ort`. -#### Build ONNX Runtime on Linux +All of the build commands below have a `--config` argument, which takes the following options: +- `Release` builds release binaries +- `Debug` build binaries with debug symbols +- `RelWithDebInfo` builds release binaries with debug info -```bash -./build.sh --build_shared_lib --skip_tests --parallel [--use_cuda] --config Release -cp include/onnxruntime/core/session/onnxruntime_c_api.h ../onnxruntime-genai/ort/include -cp build/Linux/Release/libonnxruntime*.so* ../onnxruntime-genai/ort/lib -``` +### Build Python API -You may need to provide extra command line options for building with CUDA on Linux. An example full command is as follows. +#### Windows CPU build ```bash -./build.sh --parallel --build_shared_lib --use_cuda --cuda_version 11.8 --cuda_home /usr/local/cuda-11.8 --cudnn_home /usr/lib/x86_64-linux-gnu/ --config Release --build_wheel --skip_tests --cmake_extra_defines CMAKE_CUDA_ARCHITECTURES="80" --cmake_extra_defines CMAKE_CUDA_COMPILER=/usr/local/cuda-11.8/bin/nvcc +python build.py --config Release ``` -Replace the values given above for different versions and locations of CUDA. - -#### Build ONNX Runtime on Mac +#### Windows DirectML build ```bash -./build.sh --build_shared_lib --skip_tests --parallel --config Release -cp include/onnxruntime/core/session/onnxruntime_c_api.h ../onnxruntime-genai/ort/include -cp build/MacOS/Release/libonnxruntime*.dylib* ../onnxruntime-genai/ort/lib +python build.py --use_dml --config Release ``` -## Build the generate() API - -This step assumes that you are in the root of the onnxruntime-genai repo, and you have followed the previos steps to copy the onnxruntime headers and binaries into the folder specified by , which defaults to `onnxruntime-genai/ort`. +#### Linux build ```bash -cd ../onnxruntime-genai +python build.py --config Release ``` -### Build Python API - -#### Build for Windows CPU +#### Linux CUDA build ```bash -python build.py +python build.py --use_cuda --config Release ``` -#### Build for Windows DirectML +#### Mac build ```bash -python build.py --use_dml +python build.py --config Release ``` -#### Build on Linux +### Build Java API ```bash -python build.py +python build.py --build_java --config Release ``` -#### Build on Linux with CUDA - -```bash -python build.py --use_cuda -``` +### Build for Android -#### Build on Mac +If building on Windows, install `ninja`. ```bash -python build.py +pip install ninja ``` -### Build Java API +Run the build script. ```bash -python build.py --build_java --config Release +python build.py --build_java --android --android_home --android_ndk_path --android_abi [armeabi-v7a|arm64-v8a|x86|x86_64] --config Release ``` -Change config to Debug for debug builds. ## Install the library into your application @@ -203,12 +137,28 @@ cd build/wheel pip install *.whl ``` -### Install .jar +### Install NuGet + +_Coming soon_ + +### Install JAR Copy `build/Windows/Release/src/java/build/libs/*.jar` into your application. -### Install Nuget package +### Install AAR + +Copy `build/Android/Release/src/java/build/android/outputs/aar/onnxruntime-genai-release.aar` into your application. + ### Install C/C++ header file and library -_Coming soon_ +#### Windows + +Use the header in `src\ort_genai.h` and the libraries in `build\Windows\Release` + +#### Linux + +Use the header in `src/ort_genai.h` and the libraries in `build/Linux/Release` + + + diff --git a/docs/genai/howto/install.md b/docs/genai/howto/install.md index 86f969c8ccf32..3d5e8f6c90944 100644 --- a/docs/genai/howto/install.md +++ b/docs/genai/howto/install.md @@ -21,14 +21,12 @@ Note: only one of these sets of packages (CPU, DirectML, CUDA) should be install ### CPU ```bash -pip install numpy pip install onnxruntime-genai ``` ### DirectML ```bash -pip install numpy pip install onnxruntime-genai-directml ``` @@ -43,15 +41,13 @@ Ensure that the `CUDA_PATH` environment variable is set to the location of your #### CUDA 11 ```bash -pip install numpy -pip install onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/ +pip install onnxruntime-genai-cuda --index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-11/pypi/simple/ ``` #### CUDA 12 ```bash -pip install numpy -pip install onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ +pip install onnxruntime-genai-cuda ``` @@ -65,16 +61,10 @@ Note: install only one of these packages (CPU, DirectML, CUDA) in your project. ONNX Runtime generate() versions 0.3.0 and earlier came bundled with the core ONNX Runtime binaries. From version 0.4.0 onwards, the packages are separated to allow a more flexible developer experience. -Version 0.4.0-rc1 depends on the ONNX Runtime version 1.19.0 RC. To install 0.4.0-rc1, add the following nuget source *before* installing the ONNX Runtime generate() nuget package. - -``` -dotnet nuget add source https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/nuget/v3/index.json --name ORT-Nightly -``` - ### CPU ```bash -dotnet add package Microsoft.ML.OnnxRuntimeGenAI --prerelease +dotnet add package Microsoft.ML.OnnxRuntimeGenAI ``` ### CUDA @@ -82,13 +72,13 @@ dotnet add package Microsoft.ML.OnnxRuntimeGenAI --prerelease Note: only CUDA 11 is supported for versions 0.3.0 and earlier, and only CUDA 12 is supported for versions 0.4.0 and later. ```bash -dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda --prerelease +dotnet add package Microsoft.ML.OnnxRuntimeGenAI.Cuda ``` ### DirectML ```bash -dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML --prerelease +dotnet add package Microsoft.ML.OnnxRuntimeGenAI.DirectML ``` diff --git a/docs/genai/howto/troubleshoot.md b/docs/genai/howto/troubleshoot.md index 9f0fe8c389338..fc055754bccff 100644 --- a/docs/genai/howto/troubleshoot.md +++ b/docs/genai/howto/troubleshoot.md @@ -31,4 +31,21 @@ The onnxruntime-genai Python package should run without error after this extra s ### Windows CUDA import error -After CUDA toolkit installation completed on windows, ensure that the `CUDA_PATH` system environment variable has been set to the path where the toolkit was installed. This variable will be used when importing the onnxruntime_genai python module on Windows. Unset or incorrectly set `CUDA_PATH` variable may lead to a `DLL load failed while importing onnxruntime_genai`. \ No newline at end of file +``` +DLL load failed while importing onnxruntime_genai +``` + +After CUDA toolkit installation completed on windows, ensure that the `CUDA_PATH` system environment variable has been set to the path where the toolkit was installed. This variable will be used when importing the onnxruntime_genai python module on Windows. Unset or incorrectly set `CUDA_PATH` variable may lead to a `DLL load failed while importing onnxruntime_genai`. + +### Transformers / Tokenizers incompatibility with ONNX Runtime generate() + +``` +RuntimeError: [json.exception.type_error.302] type must be string, but is array +``` + +Occurs when you generate models with the Model Builder. + +There was a change in the HuggingFace transformers version 4.45.0 that caused an incompatibility with onnxruntime-genai versions 0.4.0 and earlier, reasolved in 0.5.0. There are two alternative workarounds that you can employ to fix this issue: + +- Option 1: downgrade your transformers version to lower than v4.45.0 (which is the version in which the above change was introduced) +- Option 2: build onnxruntime-genai from source, using these instructions https://onnxruntime.ai/docs/genai/howto/build-from-source.html diff --git a/docs/genai/tutorials/phi3-python.md b/docs/genai/tutorials/phi3-python.md index 563cd5d3967f0..ed6af9d98f1ab 100644 --- a/docs/genai/tutorials/phi3-python.md +++ b/docs/genai/tutorials/phi3-python.md @@ -13,7 +13,7 @@ nav_order: 2 ## Introduction {: .no_toc } -Phi-3 ONNX models are hosted on HuggingFace and you can run them with the ONNX Runtime generate() API. +Phi-3 and Phi 3.5 ONNX models are hosted on HuggingFace and you can run them with the ONNX Runtime generate() API. The mini (3.3B) and medium (14B) versions available now, with support. Both mini and medium have a short (4k) context version and a long (128k) context version. The long context version can accept much longer prompts and produce longer output text, but it does consume more memory. @@ -28,6 +28,9 @@ Available models are: * [https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cpu](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cpu) * [https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-cuda) * [https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml/) +* [https://huggingface.co/microsoft/Phi-3.5-mini-instruct-onnx](https://huggingface.co/microsoft/Phi-3.5-mini-instruct-onnx) + +This tutorial demonstrates how to download and run the short context (4k) mini (3B) model variant pf Phi 3 model. See the [model reference](#phi-3-onnx-model-reference) for download commands for the other variants. This tutorial downloads and runs the short context (4k) mini (3B) model variant. See the [model reference](#phi-3-onnx-model-reference) for download commands for the other variants. @@ -264,3 +267,16 @@ python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-cuda/cuda-int4-rtn-block-32 git clone https://huggingface.co/microsoft/Phi-3-medium-128k-instruct-onnx-directml python phi3-qa.py -m Phi-3-medium-128k-instruct-onnx-directml/directml-int4-awq-block-128 ``` + +### Phi-3.5 mini 128k context CUDA +```bash +huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cuda/cuda-int4-awq-block-128/* --local-dir . +python phi3-qa.py -m cuda/cuda-int4-awq-block-128 +``` + +### Phi-3.5 mini 128k context CPU + +```bash +huggingface-cli download microsoft/Phi-3.5-mini-instruct-onnx --include cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4/* --local-dir . +python phi3-qa.py -m cpu_and_mobile/cpu-int4-awq-block-128-acc-level-4 +``` diff --git a/docs/genai/tutorials/phi3-v.md b/docs/genai/tutorials/phi3-v.md index ee4c70038cd01..e4aa4f75dca6e 100644 --- a/docs/genai/tutorials/phi3-v.md +++ b/docs/genai/tutorials/phi3-v.md @@ -13,14 +13,14 @@ image: /images/coffee.png The Phi-3 vision model is a small, but powerful multi modal model that allows you to use both image and text to output text. It is used in scenarios such as describing the content of images in detail. -The Phi-3 vision model is supported by versions of onnxruntime-genai 0.3.0-rc2 and later. +The Phi-3 vision model is supported by versions of onnxruntime-genai 0.3.0 and later. You can download the models here: * [https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cpu](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cpu) +* [https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-directml](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-directml) * [https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda](https://huggingface.co/microsoft/Phi-3-vision-128k-instruct-onnx-cuda) -Support for DirectML is coming soon! * TOC placeholder {:toc} @@ -46,13 +46,10 @@ Support for DirectML is coming soon! ## Choose your platform If you have an NVIDIA GPU, that will give the best performance right now. - -The models will also run on CPU, but they will be slower. - -Support for Windows machines with GPUs other than NVIDIA is coming soon! **Note: Only one package and model is required based on your hardware. That is, only execute the steps for one of the following sections** + ## Run with NVIDIA CUDA 1. Download the model @@ -60,6 +57,7 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! ```bash huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-cuda --include cuda-int4-rtn-block-32/* --local-dir . ``` + This command downloads the model into a folder called `cuda-int4-rtn-block-32`. 2. Setup your CUDA environment @@ -74,15 +72,13 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! * CUDA 11 ```bash - pip install numpy - pip install --pre onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-genai/pypi/simple/ + pip install onnxruntime-genai-cuda --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-11/pypi/simple/ ``` * CUDA 12 ```bash - pip install numpy - pip install onnxruntime-genai-cuda --pre --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ + pip install onnxruntime-genai-cuda ``` 4. Run the model @@ -91,6 +87,7 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! ```bash curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py -o phi3v.py + pip install pyreadline3 python phi3v.py -m cuda-int4-rtn-block-32 ``` @@ -117,9 +114,8 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! 2. Install the generate() API for CPU - ``` - pip install numpy - pip install --pre onnxruntime-genai + ```bash + pip install onnxruntime-genai ``` 3. Run the model @@ -128,6 +124,7 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! ```bash curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py -o phi3v.py + pip install pyreadline3 python phi3v.py -m cpu-int4-rtn-block-32-acc-level-4 ``` @@ -152,3 +149,42 @@ Support for Windows machines with GPUs other than NVIDIA is coming soon! The products include Chocolade, Gummibarchen, Scottish Longbreads, Sir Rodney's Scones, Tarte au sucre, and Chocolate Biscuits. The Grand Total column sums up the sales for each product across the two quarters. ``` + +## Run with DirectML + +1. Download the model + + ```bash + huggingface-cli download microsoft/Phi-3-vision-128k-instruct-onnx-directml --include directml-int4-rtn-block-32/* --local-dir . + ``` + + This command downloads the model into a folder called `directml-int4-rtn-block-32`. + +2. Install the generate() API + + ```bash + pip install onnxruntime-genai-directml + ``` + +3. Run the model + + Run the model with [phi3v.py](https://github.com/microsoft/onnxruntime-genai/blob/main/examples/python/phi3v.py). + + ```bash + curl https://raw.githubusercontent.com/microsoft/onnxruntime-genai/main/examples/python/phi3v.py -o phi3v.py + pip install pyreadline3 + python phi3v.py -m directml-int4-rtn-block-32 + ``` + + Enter the path to an image file and a prompt. The model uses the image and prompt to give you an answer. + + For example: `What does the sign say?` + + ![coffee](../../../images/nashville.jpg) + + ``` + The sign says 'DO NOT ENTER'. + ``` + + + diff --git a/docs/get-started/with-python.md b/docs/get-started/with-python.md index c89d92e4ad432..7ff3d1048c58d 100644 --- a/docs/get-started/with-python.md +++ b/docs/get-started/with-python.md @@ -22,26 +22,26 @@ There are two Python packages for ONNX Runtime. Only one of these packages shoul ### Install ONNX Runtime CPU -Use the CPU package if you are running on Arm CPUs and/or macOS. +Use the CPU package if you are running on Arm®-based CPUs and/or macOS. ```bash pip install onnxruntime ``` -### Install ONNX Runtime GPU (CUDA 11.x) +### Install ONNX Runtime GPU (CUDA 12.x) -The default CUDA version for ORT is 11.8. +The default CUDA version for ORT is 12.x. ```bash pip install onnxruntime-gpu ``` -### Install ONNX Runtime GPU (CUDA 12.x) +### Install ONNX Runtime GPU (CUDA 11.8) -For Cuda 12.x, please use the following instructions to install from [ORT Azure Devops Feed](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12/PyPI/onnxruntime-gpu/overview) +For Cuda 11.8, please use the following instructions to install from [ORT Azure Devops Feed](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-11/PyPI/onnxruntime-gpu/overview) ```bash -pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ +pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-11/pypi/simple/ ``` ## Install ONNX for model export @@ -260,8 +260,8 @@ If using pip, run `pip install --upgrade pip` prior to downloading. |[onnxruntime](https://pypi.org/project/onnxruntime)|CPU (Release)| Windows (x64), Linux (x64, ARM64), Mac (X64), | |[ort-nightly](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly)|CPU (Dev) | Same as above | |[onnxruntime-gpu](https://pypi.org/project/onnxruntime-gpu)|GPU (Release)| Windows (x64), Linux (x64, ARM64) | -|[ort-nightly-gpu for CUDA 11.*](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly-gpu) |GPU (Dev) | Windows (x64), Linux (x64, ARM64) | -|[ort-nightly-gpu for CUDA 12.*](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ort-cuda-12-nightly/PyPI/ort-nightly-gpu) |GPU (Dev) | Windows (x64), Linux (x64, ARM64) | +|[ort-nightly-gpu for CUDA 11.*](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ort-cuda-11-nightly/PyPI/ort-nightly-gpu) |GPU (Dev) | Windows (x64), Linux (x64, ARM64) | +|[ort-nightly-gpu for CUDA 12.*](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly-gpu) |GPU (Dev) | Windows (x64), Linux (x64, ARM64) | Before installing nightly package, you will need install dependencies first. ``` @@ -270,12 +270,12 @@ python -m pip install coloredlogs flatbuffers numpy packaging protobuf sympy Example to install ort-nightly-gpu for CUDA 11.*: ``` -python -m pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ +python -m pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-11-nightly/pypi/simple/ ``` Example to install ort-nightly-gpu for CUDA 12.*: ``` -python -m pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ort-cuda-12-nightly/pypi/simple/ +python -m pip install ort-nightly-gpu --index-url=https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/ORT-Nightly/pypi/simple/ ``` For Python compiler version notes, see [this page](https://github.com/microsoft/onnxruntime/tree/main/docs/Python_Dev_Notes.md) diff --git a/docs/install/index.md b/docs/install/index.md index d9e14b1609697..60057a88215bb 100644 --- a/docs/install/index.md +++ b/docs/install/index.md @@ -46,25 +46,29 @@ For ONNX Runtime GPU package, it is required to install [CUDA](https://developer pip install onnxruntime ``` -#### Install ONNX Runtime GPU (CUDA 11.x) -The default CUDA version for ORT is 11.8. +#### Install ONNX Runtime GPU (CUDA 12.x) +The default CUDA version for [onnxruntime-gpu in pypi](https://pypi.org/project/onnxruntime-gpu) is 12.x since 1.19.0. ```bash pip install onnxruntime-gpu ``` -#### Install ONNX Runtime GPU (CUDA 12.x) -For Cuda 12.x, please use the following instructions to install from [ORT Azure Devops Feed](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12/PyPI/onnxruntime-gpu/overview) +For previous versions, you can download here: [1.18.1](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12/PyPI/onnxruntime-gpu/overview/1.18.1), [1.18.0](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12/PyPI/onnxruntime-gpu/overview/1.18.0) + + +#### Install ONNX Runtime GPU (CUDA 11.x) +For Cuda 11.x, please use the following instructions to install from [ORT Azure Devops Feed](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-11/PyPI/onnxruntime-gpu/overview) for 1.19.2 or later. ```bash -pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-12/pypi/simple/ +pip install onnxruntime-gpu --extra-index-url https://aiinfra.pkgs.visualstudio.com/PublicPackages/_packaging/onnxruntime-cuda-11/pypi/simple/ ``` -#### Install ONNX Runtime GPU (ROCm) -For ROCm, please follow instructions to install it at the [AMD ROCm install docs](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/). The ROCm execution provider for ONNX Runtime is built and tested with ROCm 6.0.0 +For previous versions, you can download here: [1.18.1](https://pypi.org/project/onnxruntime-gpu/1.18.1/), [1.18.0](https://pypi.org/project/onnxruntime-gpu/1.18.0/) -To build from source on Linux, follow the instructions [here](https://onnxruntime.ai/docs/build/eps.html#amd-rocm). Alternatively, each major ORT release has a corresponding C/C++ ROCm package, found [here](https://github.com/microsoft/onnxruntime/releases/). +#### Install ONNX Runtime GPU (ROCm) +For ROCm, please follow instructions to install it at the [AMD ROCm install docs](https://rocm.docs.amd.com/projects/install-on-linux/en/docs-6.0.0/). The ROCm execution provider for ONNX Runtime is built and tested with ROCm 6.0.0. +To build from source on Linux, follow the instructions [here](https://onnxruntime.ai/docs/build/eps.html#amd-rocm). ### Install ONNX to export the model @@ -94,16 +98,16 @@ pip install skl2onnx dotnet add package Microsoft.ML.OnnxRuntime ``` -#### Install ONNX Runtime GPU (CUDA 11.x) +#### Install ONNX Runtime GPU (CUDA 12.x) -The default CUDA version for ORT is 11.8 +The default CUDA version for ORT is 12.x ```bash # GPU dotnet add package Microsoft.ML.OnnxRuntime.Gpu ``` -#### Install ONNX Runtime GPU (CUDA 12.x) +#### Install ONNX Runtime GPU (CUDA 11.8) 1. Project Setup @@ -116,8 +120,8 @@ a nuget.config file to your project in the same directory as your .csproj file. - + ``` @@ -405,8 +409,8 @@ below: |--------------|---------------------------------------------------------------------------------------------------------------------------------------------------|-----------------------------------------------------------------------------------------------------------------------------------------------|------------------------------------------------------------------------------------------------| | Python | If using pip, run `pip install --upgrade pip` prior to downloading. | | | | | CPU: [**onnxruntime**](https://pypi.org/project/onnxruntime) | [ort-nightly (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly/overview) | | -| | GPU (CUDA/TensorRT) for CUDA 11.x: [**onnxruntime-gpu**](https://pypi.org/project/onnxruntime-gpu) | [ort-nightly-gpu (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly-gpu/overview/) | [View](../execution-providers/CUDA-ExecutionProvider.md#requirements) | -| | GPU (CUDA/TensorRT) for CUDA 12.x: [**onnxruntime-gpu**](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-12/PyPI/onnxruntime-gpu/overview/) | [ort-nightly-gpu (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ort-cuda-12-nightly/PyPI/ort-nightly-gpu/overview/) | [View](../execution-providers/CUDA-ExecutionProvider.md#requirements) | +| | GPU (CUDA/TensorRT) for CUDA 12.x: [**onnxruntime-gpu**](https://pypi.org/project/onnxruntime-gpu) | [ort-nightly-gpu (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly-gpu/overview/) | [View](../execution-providers/CUDA-ExecutionProvider.md#requirements) | +| | GPU (CUDA/TensorRT) for CUDA 11.x: [**onnxruntime-gpu**](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/onnxruntime-cuda-11/PyPI/onnxruntime-gpu/overview/) | [ort-nightly-gpu (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ort-cuda-11-nightly/PyPI/ort-nightly-gpu/overview/) | [View](../execution-providers/CUDA-ExecutionProvider.md#requirements) | | | GPU (DirectML): [**onnxruntime-directml**](https://pypi.org/project/onnxruntime-directml/) | [ort-nightly-directml (dev)](https://aiinfra.visualstudio.com/PublicPackages/_artifacts/feed/ORT-Nightly/PyPI/ort-nightly-directml/overview/) | [View](../execution-providers/DirectML-ExecutionProvider.md#requirements) | | | OpenVINO: [**intel/onnxruntime**](https://github.com/intel/onnxruntime/releases/latest) - *Intel managed* | | [View](../build/eps.md#openvino) | | | TensorRT (Jetson): [**Jetson Zoo**](https://elinux.org/Jetson_Zoo#ONNX_Runtime) - *NVIDIA managed* | | | diff --git a/docs/performance/model-optimizations/float16.md b/docs/performance/model-optimizations/float16.md index 972f5fe516f6b..a0335ccbac70f 100644 --- a/docs/performance/model-optimizations/float16.md +++ b/docs/performance/model-optimizations/float16.md @@ -62,7 +62,9 @@ from onnxconverter_common import auto_mixed_precision import onnx model = onnx.load("path/to/model.onnx") -model_fp16 = auto_convert_mixed_precision(model, test_data, rtol=0.01, atol=0.001, keep_io_types=True) +# Assuming x is the input to the model +feed_dict = {'input': x.numpy()} +model_fp16 = auto_convert_mixed_precision(model, feed_dict, rtol=0.01, atol=0.001, keep_io_types=True) onnx.save(model_fp16, "path/to/model_fp16.onnx") ``` @@ -73,6 +75,7 @@ auto_convert_mixed_precision(model, feed_dict, validate_fn=None, rtol=None, atol ``` - `model`: The ONNX model to convert. +- `feed_dict`: Test data used to measure the accuracy of the model during conversion. Format is similar to InferenceSession.run (map of input names to values) - `validate_fn`: A function accepting two lists of numpy arrays (the outputs of the float32 model and the mixed-precision model, respectively) that returns `True` if the results are sufficiently close and `False` otherwise. Can be used instead of or in addition to `rtol` and `atol`. - `rtol`, `atol`: Absolute and relative tolerances used for validation. See [numpy.allclose](https://numpy.org/doc/stable/reference/generated/numpy.allclose.html) for more information. - `keep_io_types`: Whether model inputs/outputs should be left as float32. diff --git a/docs/performance/model-optimizations/quantization.md b/docs/performance/model-optimizations/quantization.md index c769b0889fa23..ae49e591d94ca 100644 --- a/docs/performance/model-optimizations/quantization.md +++ b/docs/performance/model-optimizations/quantization.md @@ -202,7 +202,7 @@ ONNX Runtime quantization on GPU only supports S8S8. On x86-64 machines with AVX2 and AVX512 extensions, ONNX Runtime uses the VPMADDUBSW instruction for U8S8 for performance. This instruction might suffer from saturation issues: it can happen that the output does not fit into a 16-bit integer and has to be clamped (saturated) to fit. Generally, this is not a big issue for the final result. However, if you do encounter a large accuracy drop, it may be caused by saturation. In this case, you can either try [reduce_range](https://github.com/microsoft/onnxruntime/blob/main/onnxruntime/python/tools/quantization/quantize.py) or the U8U8 format which doesn't have saturation issues. -There is no such issue on other CPU architectures (x64 with VNNI and ARM). +There is no such issue on other CPU architectures (x64 with VNNI and Arm®). ### List of Supported Quantized Ops {: .no_toc} @@ -231,13 +231,66 @@ ONNX Runtime leverages the TensorRT Execution Provider for quantization on GPU n We provide two end-to end examples: [Yolo V3](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/object_detection/trt/yolov3) and [resnet50](https://github.com/microsoft/onnxruntime-inference-examples/tree/main/quantization/image_classification/trt/resnet50). +## Quantize to Int4/UInt4 + +ONNX Runtime can quantize certain operators in a model to 4 bit integer types. Block-wise weight-only quantizaiton is applied to the operators. The supported op types are: +- [MatMul](https://github.com/onnx/onnx/blob/main/docs/Operators.md#matmul): + - The node is quantized only if the input `B` is constant + - support QOperator or QDQ format. + - If QOperator is selected, the node is converted to a [MatMulNBits](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftmatmulnbits) node. Weight `B` is blockwise quantized and saved in the new node. [HQQ](https://arxiv.org/pdf/2309.15531.pdf), [GPTQ](https://huggingface.co/docs/transformers/main/en/quantization/gptq) and RTN (default) algorithms are supported. + - If QDQ is selected, the MatMul node is replaced by a DequantizeLinear -> MatMul pair. Weight `B` is blockwise quantized and saved in the DequantizeLinear node as an initializer. +- [Gather](https://github.com/onnx/onnx/blob/main/docs/Operators.md#Gather): + - The node is quantized only if the input `data` is constant. + - support QOperator + - Gather is quantized to a [GatherBlockQuantized](https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#commicrosoftgatherblockquantized) node. Input `data` is blockwise quantized and saved in the new node. Only support RTN algorithm. + +Since Int4/UInt4 types are introduced in [onnx opset 21](https://github.com/onnx/onnx/releases/tag/v1.16.0), if the model's onnx domain version is < 21, it is force upgraded to opset 21. Please make sure the operators in the model are compatible with onnx opset 21. + +To run a model that has GatherBlockQuantized nodes, ONNX Runtime 1.20 is needed. + +Code Examples: + +```python +from onnxruntime.quantization import ( + matmul_4bits_quantizer, + quant_utils, + quantize +) +from pathlib import Path + +model_fp32_path="path/to/orignal/model.onnx" +model_int4_path="path/to/save/quantized/model.onnx" + +quant_config = matmul_4bits_quantizer.DefaultWeightOnlyQuantConfig( + block_size=128, # 2's exponential and >= 16 + is_symmetric=True, # if true, quantize to Int4. otherwsie, quantize to uint4. + accuracy_level=4, # used by MatMulNbits, see https://github.com/microsoft/onnxruntime/blob/main/docs/ContribOperators.md#attributes-35 + quant_format=quant_utils.QuantFormat.QOperator, + op_types_to_quantize=("MatMul","Gather"), # specify which op types to quantize + quant_axes=(("MatMul", 0), ("Gather", 1),) # specify which axis to quantize for an op type. + +model = quant_utils.load_model_with_shape_infer(Path(model_fp32_path)) +quant = matmul_4bits_quantizer.MatMul4BitsQuantizer( + model, + nodes_to_exclude=None, # specify a list of nodes to exclude from quantizaiton + nodes_to_include=None, # specify a list of nodes to force include from quantization + algo_config=quant_config,) +quant.process() +quant.model.save_model_to_file( + model_int4_path, + True) # save data to external file + +``` + +For AWQ and GTPQ quantization usage, please refer to [Gen-AI model builder](https://github.com/microsoft/onnxruntime-genai/tree/main/src/python/py/models#quantized-pytorch-model). + ## FAQ ### Why am I not seeing performance improvements? {: .no_toc } The performance improvement depends on your model and hardware. The performance gain from quantization has two aspects: compute and memory. Old hardware has none or few of the instructions needed to perform efficient inference in int8. And quantization has overhead (from quantizing and dequantizing), so it is not rare to get worse performance on old devices. -x86-64 with VNNI, GPU with Tensor Core int8 support and ARM with dot-product instructions can get better performance in general. +x86-64 with VNNI, GPU with Tensor Core int8 support and Arm®-based processors with dot-product instructions can get better performance in general. ### Which quantization method should I choose, dynamic or static? {: .no_toc} diff --git a/docs/tutorials/csharp/stable-diffusion-csharp.md b/docs/tutorials/csharp/stable-diffusion-csharp.md index 588fb18e70436..5ba5ec6ea6bfe 100644 --- a/docs/tutorials/csharp/stable-diffusion-csharp.md +++ b/docs/tutorials/csharp/stable-diffusion-csharp.md @@ -52,8 +52,6 @@ To run in the cloud with Azure Machine Learning: The Hugging Face site has a great library of open source models. We will leverage and download the [ONNX Stable Diffusion models from Hugging Face](https://huggingface.co/models?sort=downloads&search=Stable+Diffusion). - [Stable Diffusion Models v1.4](https://huggingface.co/CompVis/stable-diffusion-v1-4/tree/onnx) - - [Stable Diffusion Models v1.5](https://huggingface.co/runwayml/stable-diffusion-v1-5/tree/onnx) - Once you have selected a model version repo, click `Files and Versions`, then select the `ONNX` branch. If there isn't an ONNX model branch available, use the `main` branch and convert it to ONNX. See the [ONNX conversion tutorial for PyTorch](https://learn.microsoft.com/windows/ai/windows-ml/tutorials/pytorch-convert-model) for more information. diff --git a/docs/tutorials/mobile/pose-detection.md b/docs/tutorials/mobile/pose-detection.md index ad4296aa64603..248d06889550a 100644 --- a/docs/tutorials/mobile/pose-detection.md +++ b/docs/tutorials/mobile/pose-detection.md @@ -19,7 +19,7 @@ Learn how to build and run ONNX models on mobile with built-in pre and post proc ## Object detection with YOLOv8 -You can find the full source code for the [Android](https://github.com/microsoft/ app in the ONNX Runtime inference examples repository. +You can find the full source code for the [Android](https://github.com/microsoft/) app in the ONNX Runtime inference examples repository. ### Build the ONNX model with built-in pre and post processing diff --git a/docs/tutorials/on-device-training/android-app.md b/docs/tutorials/on-device-training/android-app.md index b9b0ae49c7bec..ab528a5a1c1ad 100644 --- a/docs/tutorials/on-device-training/android-app.md +++ b/docs/tutorials/on-device-training/android-app.md @@ -7,15 +7,15 @@ nav_order: 1 --- # On-Device Training: Building an Android Application - +{: .no_toc } In this tutorial, we will explore how to build an Android application that incorporates ONNX Runtime's On-Device Training solution. On-device training refers to the process of training a machine learning model directly on an edge device without relying on cloud services or external servers. Here is what the application will look like at the end of this tutorial: - +an image classification app with Tom Cruise in the middle. ## Introduction - +{: .no_toc } We will guide you through the steps to create an Android app that can train a simple image classification model using on-device training techniques. This tutorial showcases the `transfer learning` technique where knowledge gained from training a model on one task is leveraged to improve the performance of a model on a different but related task. Instead of starting the learning process from scratch, transfer learning allows us to transfer the knowledge or features learned by a pre-trained model to a new task. For this tutorial, we will leverage the `MobileNetV2` model which has been trained on large-scale image datasets such as ImageNet (which has 1,000 classes). We will use this model for classifying custom data into one of four classes. The initial layers of MobileNetV2 serve as a feature extractor, capturing generic visual features applicable to various tasks, and only the final classifier layer will be trained for the task at hand. @@ -24,26 +24,10 @@ In this tutorial, we will use data to learn to: - Classify animals into one of four categories using a pre-packed animals dataset. - Classify celebrities into one of four categories using a custom celebrities dataset. -## Contents - -- [Introduction](#introduction) -- [Prerequisites](#prerequisites) -- [Offline Phase - Building the training artifacts](#offline-phase---building-the-training-artifacts) - - [Export the model to ONNX](#op1) - - [Define the trainable and non trainable parameters](#op2) - - [Generate the training artifacts](#op3) -- [Training Phase - Android application development](#training-phase---android-application-development) - - [Setting up the project in Android Studio](#tp1) - - [Adding the ONNX Runtime dependency](#tp2) - - [Packaging the Prebuilt Training Artifacts and Dataset](#tp3) - - [Interfacing with ONNX Runtime - C++ Code](#tp4) - - [Image Preprocessing](#tp5) - - [Application Frontend](#tp6) -- [Training Phase - Running the application on a device](#training-phase---running-the-application-on-a-device) - - [Running the application on a device](#tp7) - - [Training with a pre-loaded dataset - Animals](#tp8) - - [Training with a custom dataset - Celebrities](#tp9) -- [Conclusion](#conclusion) + +## Table of Contents +* TOC placeholder +{:toc} ## Prerequisites @@ -791,7 +775,7 @@ To follow this tutorial, you should have a basic understanding of Android app de b. Launching the application on the device should look like this: - + Barebones ORT Personalize app 2. Training with a pre-loaded dataset - Animals @@ -805,7 +789,7 @@ To follow this tutorial, you should have a basic understanding of Android app de e. Use any animal image from your library for inferencing now. - + ORT Personalize app with an image of a cow As can be seen from the image above, the model correctly predicted `Cow`. @@ -825,7 +809,7 @@ To follow this tutorial, you should have a basic understanding of Android app de g. That's it!. Hopefully the application classified the image correctly. - + an image classification app with Tom Cruise in the middle. ## Conclusion diff --git a/docs/tutorials/on-device-training/ios-app.md b/docs/tutorials/on-device-training/ios-app.md index fff1347923ef0..e61bab68596ff 100644 --- a/docs/tutorials/on-device-training/ios-app.md +++ b/docs/tutorials/on-device-training/ios-app.md @@ -7,7 +7,7 @@ nav_order: 2 --- # Building an iOS Application - +{: .no_toc } In this tutorial, we will explore how to build an iOS application that incorporates ONNX Runtime's On-Device Training solution. On-device training refers to the process of training a machine learning model directly on an edge device without relying on cloud services or external servers. In this tutorial, we will build a simple speaker identification app that learns to identify a speaker's voice. We will take a look at how to train a model on-device, export the trained model, and use the trained model to perform inference. @@ -18,6 +18,7 @@ Here is what the application will look like: application demo, with buttons for voice, train, and infer. ## Introduction +{: .no_toc } We will guide you through the process of building an iOS application that can train a simple audio classification model using on-device training techniques. The tutorial showcases the `transfer learning` technique where knowledge gained from training a model on one task is leveraged to improve the performance of a model on a different but related task. Instead of starting the learning process from scratch, transfer learning allows us to transfer the knowledge or features learned by a pre-trained model to a new task. In this tutorial, we will leverage the [`wav2vec`](https://huggingface.co/superb/wav2vec2-base-superb-sid) model which has been trained on large-scale celebrity speech data such as `VoxCeleb1`. We will use the pre-trained model to extract features from the audio data and train a binary classifier to identify the speaker. The initial layers of the model serve as a feature extractor, capturing the important features of the audio data. Only the last layer of the model is trained to perform the classification task. @@ -29,23 +30,9 @@ In the tutorial, we will: - Use the exported model to perform inference -## Contents -- [Building an iOS Application](#building-an-ios-application) - - [Introduction](#introduction) - - [Contents](#contents) - - [Prerequisites](#prerequisites) - - [Generating the training artifacts](#generating-the-training-artifacts) - - [Building the iOS application](#building-the-ios-application) - - [Xcode Setup](#xcode-setup) - - [Application Overview](#application-overview) - - [Training the model](#training-the-model) - - [Inference with the trained model](#inference-with-the-trained-model) - - [Recording Audio](#recording-audio) - - [Train View](#train-view) - - [Infer View](#infer-view) - - [ContentView](#contentview) - - [Running the iOS application](#running-the-ios-application) - - [Conclusion](#conclusion) +## Table of Contents +* TOC placeholder +{:toc} ## Prerequisites @@ -947,27 +934,27 @@ Now, we are ready to run the application. You can run the application on the sim a. Now, when you run the application, you should see the following screen: - +My Voice application with Train and Infer buttons b. Next, click on the `Train` button to navigate to the `TrainView`. The `TrainView` will prompt you to record your voice. You will need to record your voice `kNumRecordings` times. - +My Voice application with words to record c. Once all the recordings are complete, the application will train the model on the given data. You will see the progress bar indicating the progress of the training. - +Loading bar while the app is training d. Once the training is complete, you will see the following screen: - +The app informs you training finished successfully! e. Now, click on the `Infer` button to navigate to the `InferView`. The `InferView` will prompt you to record your voice. Once the recording is complete, it will perform inference with the trained model and display the result of the inference. - +My Voice application allows you to record and infer whether it's you or not. That's it! Hopefully, it identified your voice correctly. diff --git a/docs/tutorials/web/ep-webnn.md b/docs/tutorials/web/ep-webnn.md index fe1c1d729daf0..f04dd7870d7cb 100644 --- a/docs/tutorials/web/ep-webnn.md +++ b/docs/tutorials/web/ep-webnn.md @@ -74,59 +74,59 @@ To use WebNN EP, you just need to make 3 small changes: WebNN API and WebNN EP are in actively development, you might consider installing the latest nightly build version of ONNX Runtime Web (onnxruntime-web@dev) to benefit from the latest features and improvements. -## Keep tensor data on WebNN MLBuffer (IO binding) +## Keep tensor data on WebNN MLTensor (IO binding) -By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLBuffer to address this issue. -If you get your input data from a MLBuffer, or you want to keep the output data on MLBuffer for further processing, you can use IO binding to keep the data on MLBuffer. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input. +By default, a model's inputs and outputs are tensors that hold data in CPU memory. When you run a session with WebNN EP with 'gpu' or 'npu' device type, the data is copied to GPU or NPU memory, and the results are copied back to CPU memory. Memory copy between different devices as well as different sessions will bring much overhead to the inference time, WebNN provides a new opaque device-specific storage type MLTensor to address this issue. +If you get your input data from a MLTensor, or you want to keep the output data on MLTensor for further processing, you can use IO binding to keep the data on MLTensor. This will be especially helpful when running transformer based models, which usually runs a single model multiple times with previous output as the next input. -For model input, if your input data is a WebNN storage MLBuffer, you can [create a MLBuffer tensor and use it as input tensor](#create-input-tensor-from-a-mlbuffer). +For model input, if your input data is a WebNN storage MLTensor, you can [create a MLTensor tensor and use it as input tensor](#create-input-tensor-from-a-mltensor). For model output, there are 2 ways to use the IO binding feature: -- [Use pre-allocated MLBuffer tensors](#use-pre-allocated-mlbuffer-tensors) +- [Use pre-allocated MLTensor tensors](#use-pre-allocated-mltensor-tensors) - [Specify the output data location](#specify-the-output-data-location) Please also check the following topic: -- [MLBuffer tensor life cycle management](#mlbuffer-tensor-life-cycle-management) +- [MLTensor tensor life cycle management](#mltensor-tensor-life-cycle-management) -**Note:** The MLBuffer necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions. +**Note:** The MLTensor necessitates a shared MLContext for IO binding. This implies that the MLContext should be pre-created as a WebNN EP option and utilized across all sessions. -### Create input tensor from a MLBuffer +### Create input tensor from a MLTensor -If your input data is a WebNN storage MLBuffer, you can create a MLBuffer tensor and use it as input tensor: +If your input data is a WebNN storage MLTensor, you can create a MLTensor tensor and use it as input tensor: ```js const mlContext = await navigator.ml.createContext({deviceType, ...}); -const inputMLBuffer = await mlContext.createBuffer({ +const inputMLTensor = await mlContext.createTensor({ dataType: 'float32', dimensions: [1, 3, 224, 224], - usage: MLBufferUsage.WRITE_TO, + usage: MLTensorUsage.WRITE, }); -mlContext.writeBuffer(mlBuffer, inputArrayBuffer); -const inputTensor = ort.Tensor.fromMLBuffer(mlBuffer, { +mlContext.writeTensor(inputMLTensor, inputArrayBuffer); +const inputTensor = ort.Tensor.fromMLTensor(inputMLTensor, { dataType: 'float32', dims: [1, 3, 224, 224] }); ``` -Use this tensor as model inputs(feeds) so that the input data will be kept on MLBuffer. +Use this tensor as model inputs(feeds) so that the input data will be kept on MLTensor. -### Use pre-allocated MLBuffer tensors +### Use pre-allocated MLTensor tensors -If you know the output shape in advance, you can create a MLBuffer tensor and use it as output tensor: +If you know the output shape in advance, you can create a MLTensor tensor and use it as output tensor: ```js -// Create a pre-allocated buffer and the corresponding tensor. Assuming that the output shape is [10, 1000]. +// Create a pre-allocated MLTensor and the corresponding ORT tensor. Assuming that the output shape is [10, 1000]. const mlContext = await navigator.ml.createContext({deviceType, ...}); -const myPreAllocatedBuffer = await mlContext.createBuffer({ +const myPreAllocatedMLTensor = await mlContext.createTensor({ dataType: 'float32', dimensions: [10, 1000], - usage: MLBufferUsage.READ_FROM, + usage: MLTensorUsage.READ, }); -const myPreAllocatedOutputTensor = ort.Tensor.fromMLBuffer(myPreAllocatedBuffer, { +const myPreAllocatedOutputTensor = ort.Tensor.fromMLTensor(myPreAllocatedMLTensor, { dataType: 'float32', dims: [10, 1000] }); @@ -140,17 +140,17 @@ const results = await mySession.run(feeds, fetches); ``` -By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated buffer as the output buffer. If there is a shape mismatch, the `run()` call will fail. +By specifying the output tensor in the fetches, ONNX Runtime Web will use the pre-allocated MLTensor as the output tensor. If there is a shape mismatch, the `run()` call will fail. ### Specify the output data location -If you don't want to use pre-allocated MLBuffer tensors for outputs, you can also specify the output data location in the session options: +If you don't want to use pre-allocated MLTensor tensors for outputs, you can also specify the output data location in the session options: ```js const mySessionOptions1 = { ..., - // keep all output data on MLBuffer - preferredOutputLocation: 'ml-buffer' + // keep all output data on MLTensor + preferredOutputLocation: 'ml-tensor' }; const mySessionOptions2 = { @@ -158,7 +158,7 @@ const mySessionOptions2 = { // alternatively, you can specify the output location for each output tensor preferredOutputLocation: { 'output_0': 'cpu', // keep output_0 on CPU. This is the default behavior. - 'output_1': 'ml-buffer' // keep output_1 on MLBuffer buffer + 'output_1': 'ml-tensor' // keep output_1 on MLTensor tensor } }; ``` @@ -169,18 +169,18 @@ See [API reference: preferredOutputLocation](https://onnxruntime.ai/docs/api/js/ ## Notes -### MLBuffer tensor life cycle management +### MLTensor tensor life cycle management -It is important to understand how the underlying MLBuffer is managed so that you can avoid memory leaks and improve buffer usage efficiency. +It is important to understand how the underlying MLTensor is managed so that you can avoid memory leaks and improve tensor usage efficiency. -A MLBuffer tensor is created either by user code or by ONNX Runtime Web as model's output. -- When it is created by user code, it is always created with an existing MLBuffer using `Tensor.fromMLBuffer()`. In this case, the tensor does not "own" the MLBuffer. +A MLTensor tensor is created either by user code or by ONNX Runtime Web as model's output. +- When it is created by user code, it is always created with an existing MLTensor using `Tensor.fromMLTensor()`. In this case, the tensor does not "own" the MLTensor. - - It is user's responsibility to make sure the underlying buffer is valid during the inference, and call `mlBuffer.destroy()` to dispose the buffer when it is no longer needed. - - Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLBuffer directly. - - Using a MLBuffer tensor with a destroyed MLBuffer will cause the session run to fail. -- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLBuffer tensor), the tensor "owns" the buffer. + - It is user's responsibility to make sure the underlying MLTensor is valid during the inference, and call `mlTensor.destroy()` to dispose the MLTensor when it is no longer needed. + - Avoid calling `tensor.getData()` and `tensor.dispose()`. Use the MLTensor tensor directly. + - Using a MLTensor tensor with a destroyed MLTensor will cause the session run to fail. +- When it is created by ONNX Runtime Web as model's output (not a pre-allocated MLTensor tensor), the tensor "owns" the MLTensor. - - You don't need to worry about the case that the buffer is destroyed before the tensor is used. - - Call `tensor.getData()` to download the data from the MLBuffer to CPU and get the data as a typed array. - - Call `tensor.dispose()` explicitly to destroy the underlying MLBuffer when it is no longer needed. + - You don't need to worry about the case that the MLTensor is destroyed before the tensor is used. + - Call `tensor.getData()` to download the data from the MLTensor to CPU and get the data as a typed array. + - Call `tensor.dispose()` explicitly to destroy the underlying MLTensor when it is no longer needed. diff --git a/images/EP_context_node.png b/images/EP_context_node.png new file mode 100644 index 0000000000000..953bcf353558a Binary files /dev/null and b/images/EP_context_node.png differ diff --git a/images/EP_context_nodes_with_different_eps.png b/images/EP_context_nodes_with_different_eps.png new file mode 100644 index 0000000000000..c7b986d0f9c89 Binary files /dev/null and b/images/EP_context_nodes_with_different_eps.png differ diff --git a/images/Onnx_weight_sharing.png b/images/Onnx_weight_sharing.png new file mode 100644 index 0000000000000..b3c277903ddfb Binary files /dev/null and b/images/Onnx_weight_sharing.png differ diff --git a/images/Ort_Qnn_Ep_weight_sharing.png b/images/Ort_Qnn_Ep_weight_sharing.png new file mode 100644 index 0000000000000..e8fa37d1bb2a4 Binary files /dev/null and b/images/Ort_Qnn_Ep_weight_sharing.png differ diff --git a/images/Qnn_weight_sharing.png b/images/Qnn_weight_sharing.png new file mode 100644 index 0000000000000..d415c3bfc57ca Binary files /dev/null and b/images/Qnn_weight_sharing.png differ diff --git a/images/nashville.jpg b/images/nashville.jpg new file mode 100644 index 0000000000000..da40173230e0c Binary files /dev/null and b/images/nashville.jpg differ diff --git a/src/app.html b/src/app.html index 5f79324942486..cdfdad8b3f2dc 100644 --- a/src/app.html +++ b/src/app.html @@ -36,6 +36,11 @@ }, propertyConfiguration: { // Properties Plugin configuration + gpcDataSharingOptIn: false, + callback: { + userConsentDetails: _getWcpUserConsentDetails + }, + env: 'PROD' // Environment can be set to PPE or PROD as needed. }, webAnalyticsConfiguration: { @@ -77,6 +82,7 @@ } }; + var siteConsent = null; WcpConsent.init( 'en-US', 'cookie-banner', @@ -91,6 +97,24 @@ WcpConsent.themes.light ); + function _getWcpUserConsentDetails() { + if (siteConsent) { + return siteConsent.getConsent(); + } + + // The exact value that you return here is dependent on your site, team and how + // use any data that is stored (work with you privacy team to determine what the + // correct "defaults" (true or false) should be for each item when the code is + // unable to determine (via WCP) if or what the user has (or has not) consented + // to. + return { + Required: [true], // Most likely `true` + Analytics: [true], + SocialMedia: [true], + Advertising: [false] + }; + } + function onConsentChanged(categoryPreferences) { if (categoryPreferences.Analytics) { // Google Analytics diff --git a/src/images/logos/autodesk-logo.png b/src/images/logos/autodesk-logo.png new file mode 100644 index 0000000000000..cb7d223734dbb Binary files /dev/null and b/src/images/logos/autodesk-logo.png differ diff --git a/src/images/logos/goodnotes-logo.png b/src/images/logos/goodnotes-logo.png new file mode 100644 index 0000000000000..86ee9ccee519a Binary files /dev/null and b/src/images/logos/goodnotes-logo.png differ diff --git a/src/routes/blogs/+page.svelte b/src/routes/blogs/+page.svelte index 8dda46876bcdc..bbc36db183912 100644 --- a/src/routes/blogs/+page.svelte +++ b/src/routes/blogs/+page.svelte @@ -366,6 +366,12 @@ } ]; let blogsCommunity = [ + { + title:'Running Phi-3 Mistral 7B LLMs on Raspberry Pi 5: A Step-by-Step Guide', + date: 'September 5, 2024', + link: 'https://medium.com/@vadikus/running-phi-3-mistral-7b-llms-on-raspberry-pi-5-a-step-by-step-guide-185e8102e35b', + blurb: 'Learn how to run Phi-3 Mistral 7B on Raspberry Pi 5 using the ONNX Runtime Gen AI library.' + }, { title: 'Deploying a Production-Ready RAG Server: A Comprehensive Guide with LlamaIndex', diff --git a/src/routes/blogs/nimbleedge-x-onnxruntime/+page.svx b/src/routes/blogs/nimbleedge-x-onnxruntime/+page.svx index 7dc2f326cb3f5..48efc63953143 100644 --- a/src/routes/blogs/nimbleedge-x-onnxruntime/+page.svx +++ b/src/routes/blogs/nimbleedge-x-onnxruntime/+page.svx @@ -32,7 +32,7 @@ url: 'https://onnxruntime.ai/blogs/nimbleedge-x-onnxruntime' [NimbleEdge](https://www.nimbleedge.com/) is an on-device Machine Learning (ML) platform that enables real-time personalization in mobile apps, executing data capture, processing and ML inference on end users' mobile devices vs. on cloud. Using mobile compute efficiently to deliver optimal performance with minimal device resource usage is a key priority for NimbleEdge. For this, NimbleEdge leverages various ML inference runtimes, including, prominently, **ONNX Runtime**. -In this blog post, we'll explore how on-device compute can be leveraged for cost-efficient, privacy-preserving real-time ML in mobile apps, and how NimbleEdge leverages ONNX Runtime to enable this. We also share results from NimbleEdge's on-device deployment with Dream11, India's largest fantasy gaming platform with 200Mn+ users. +In this blog post, we'll explore how on-device compute can be leveraged for cost-efficient, privacy-preserving real-time ML in mobile apps, and how NimbleEdge leverages ONNX Runtime to enable this. We also share results from NimbleEdge’s on-device deployment with one of India’s largest fantasy gaming platforms with hundreds of millions of users. ### **Introduction** @@ -102,17 +102,17 @@ For inference execution, NimbleEdge utilizes a number of runtimes, prominently i Through the capabilities listed here, NimbleEdge's comprehensive on-device ML platform enables high performance real-time ML deployments in days vs. months. -### **Case Study: Real time ranking of fantasy sports contests for Dream11** +### **Case Study: Real time ranking of fantasy sports contests for leading Indian fantasy gaming co** -Dream11 is an Indian fantasy sports platform (like Fanduel/ Draftkings in USA) with 200M+ users, and a peak concurrency of ~15 million users. Dream11 offers thousands of fantasy contests across dozens of matches from 10+ sports, with each contest varying in contest entry amount, win %, and participant count. +Fantasy Gaming co (name obscured for confidentiality) is an Indian fantasy sports platform (like Fanduel/ Draftkings in USA) with hundreds of millions of users, and a peak concurrency of several million users. Fantasy Gaming co offers thousands of fantasy contests across dozens of matches from 10+ sports, with each contest varying in contest entry amount, win %, and no. of participants. -To streamline the user journey, Dream11 was running a recommendation system that delivered personalized contest recommendations to users, based on historical interactions. Dream11 analyzed customer clickstream data, and identified that incorporating in-session user interactions in the recommender systems would significantly improve quality of recommendations vs. leveraging batch predictions generated hourly. +To streamline the user journey, Fantasy Gaming co was running a recommendation system that delivered personalized contest recommendations to users, based on historical interactions. They analyzed customer clickstream data, and identified that incorporating in-session user interactions in the recommender systems would significantly improve quality of recommendations vs. leveraging batch predictions generated hourly. -Due to this, Dream11 was keen to deploy real-time, session-aware recommendations, but implementation was challenging due to the aforementioned challenges in real-time ML on cloud. Hence, Dream11 turned to on-device ML with NimbleEdge for implementing real-time personalized contest recommendations. +Due to this, Fantasy Gaming co was keen to deploy real-time, session-aware recommendations, but implementation was challenging due to the aforementioned challenges in real-time ML on cloud. Hence, Fantasy Gaming co turned to on-device ML with NimbleEdge for implementing real-time personalized contest recommendations. **Results** -With NimbleEdge, Dream11 is now able to generate features and predictions based on real-time user interactions, resulting in improved relevance of recommendations for millions of users. Additionally, inference was delivered at millisecond latency, with minimal battery and CPU usage impact! +With NimbleEdge, Fantasy Gaming co is now able to generate features and predictions based on real-time user interactions, resulting in improved relevance of recommendations for millions of users. Additionally, inference was delivered at millisecond latency, with minimal battery and CPU usage impact! **No. of inferences:** `7B+` diff --git a/src/routes/blogs/pytorch-on-the-edge/+page.svelte b/src/routes/blogs/pytorch-on-the-edge/+page.svelte index 83ab6d2d49db6..d0a9d765cd5f1 100644 --- a/src/routes/blogs/pytorch-on-the-edge/+page.svelte +++ b/src/routes/blogs/pytorch-on-the-edge/+page.svelte @@ -179,9 +179,9 @@ fun run(audioTensor: OnnxTensor): Result {

Run PyTorch models on the edge

- By: Natalie Kershaw + By: Natalie Kershaw and - Prasanth Pulavarthi

@@ -217,12 +217,12 @@ fun run(audioTensor: OnnxTensor): Result { anywhere that is outside of the cloud, ranging from large, well-resourced personal computers to small footprint devices such as mobile phones. This has been a challenging task to accomplish in the past, but new advances in model optimization and software like - ONNX Runtime + ONNX Runtime make it more feasible - even for new generative AI and large language models like Stable Diffusion, Whisper, and Llama2.

-

Considerations for PyTorch models on the edge

+

Considerations for PyTorch models on the edge

There are several factors to keep in mind when thinking about running a PyTorch model on the @@ -292,7 +292,7 @@ fun run(audioTensor: OnnxTensor): Result { -

Tools for PyTorch models on the edge

+

Tools for PyTorch models on the edge

We mentioned ONNX Runtime several times above. ONNX Runtime is a compact, standards-based @@ -305,7 +305,7 @@ fun run(audioTensor: OnnxTensor): Result { format that doesn't require the PyTorch framework and its gigabytes of dependencies. PyTorch has thought about this and includes an API that enables exactly this - torch.onnxtorch.onnx. ONNX is an open standard that defines the operators that make up models. The PyTorch ONNX APIs take the Pythonic PyTorch code and turn it into a functional graph that captures the operators that are needed to run the model without Python. As with everything @@ -318,7 +318,7 @@ fun run(audioTensor: OnnxTensor): Result { The popular Hugging Face library also has APIs that build on top of this torch.onnx functionality to export models to the ONNX format. Over 130,000 models130,000 models are supported making it very likely that the model you care about is one of them.

@@ -328,7 +328,7 @@ fun run(audioTensor: OnnxTensor): Result { and web browsers) via various languages (from C# to JavaScript to Swift).

-

Examples of PyTorch models on the edge

+

Examples of PyTorch models on the edge

Stable Diffusion on Windows

@@ -345,7 +345,7 @@ fun run(audioTensor: OnnxTensor): Result {

You don't have to export the fifth model, ClipTokenizer, as it is available in ONNX Runtime extensionsONNX Runtime extensions, a library for pre and post processing PyTorch models.

@@ -353,7 +353,7 @@ fun run(audioTensor: OnnxTensor): Result { To run this pipeline of models as a .NET application, we build the pipeline code in C#. This code can be run on CPU, GPU, or NPU, if they are available on your machine, using ONNX Runtime's device-specific hardware accelerators. This is configured with the ExecutionProviderTargetExecutionProviderTarget below.

@@ -366,7 +366,7 @@ fun run(audioTensor: OnnxTensor): Result {

You can build the application and run it on Windows with the detailed steps shown in this tutorialtutorial.

@@ -374,7 +374,7 @@ fun run(audioTensor: OnnxTensor): Result {

Running a PyTorch model locally in the browser is not only possible but super simple with - the transformers.js library. Transformers.js uses ONNX Runtime Web as its backend. Many models are already converted to ONNX and served by the tranformers.js CDN, making inference in the browser a matter of writing @@ -407,7 +407,7 @@ fun run(audioTensor: OnnxTensor): Result { All components of the Whisper Tiny model (audio decoder, encoder, decoder, and text sequence generation) can be composed and exported to a single ONNX model using the Olive frameworkOlive framework. To run this model as part of a mobile application, you can use ONNX Runtime Mobile, which supports Android, iOS, react-native, and MAUI/Xamarin.

@@ -420,7 +420,7 @@ fun run(audioTensor: OnnxTensor): Result {

The relevant snippet of a example Android mobile appAndroid mobile app that performs speech transcription on short samples of audio is shown below:

@@ -476,11 +476,11 @@ fun run(audioTensor: OnnxTensor): Result {

You can read the full Speaker Verification tutorialSpeaker Verification tutorial, and build and run the application from sourcebuild and run the application from source.

diff --git a/src/routes/components/customers.svelte b/src/routes/components/customers.svelte index a5da8146bea27..6c6c7dce06171 100644 --- a/src/routes/components/customers.svelte +++ b/src/routes/components/customers.svelte @@ -8,32 +8,36 @@ import antgroupLogo from '../../images/logos/antgroup-logo.png'; import algoriddimLogo from '../../images/logos/algoriddim-logo.png'; import ATLASLogo from '../../images/logos/ATLAS-logo.png'; + import autodeskLogo from '../../images/logos/autodesk-logo.png'; import bazaarvoiceLogo from '../../images/logos/bazaarvoice-logo.png'; import camoLogo from '../../images/logos/camo-logo.png'; import cephableLogo from '../../images/logos/cephable-logo.png'; import clearbladeLogo from '../../images/logos/clearblade-logo.png'; import deezerLogo from '../../images/logos/deezer-logo.png'; + import goodnotesLogo from '../../images/logos/goodnotes-logo.png'; + import huggingfaceLogo from '../../images/logos/huggingface-logo.png'; import hypefactorsLogo from '../../images/logos/hypefactors-logo.png'; import infarmLogo from '../../images/logos/infarm-logo.png'; import intelLogo from '../../images/logos/intel-logo.png'; import intelligenzaEticaLogo from '../../images/logos/intelligenza-etica-logo.png'; - import navitaireAmadeusLogo from '../../images/logos/navitaire-amadeus-logo.png'; - import PeakSpeedLogo from '../../images/logos/PeakSpeed_logo.png'; + import navitaireLogo from '../../images/logos/navitaire-amadeus-logo.png'; + import nvidiaLogo from '../../images/logos/nvidia.png'; + import opennlpLogo from '../../images/logos/opennlp-logo.png'; + import oracleLogo from '../../images/logos/oracle-logo.png'; + import peakspeedLogo from '../../images/logos/PeakSpeed_logo.png'; import piecesLogo from '../../images/logos/pieces-logo.png'; + import ptwLogo from '../../images/logos/ptw-logo.png'; import redisLogo from '../../images/logos/redis-logo.png'; - import RockchipLogo from '../../images/logos/Rockchip-logo.png'; + import rockchipLogo from '../../images/logos/Rockchip-logo.png'; import samtecLogo from '../../images/logos/samtec-logo.png'; import sasLogo from '../../images/logos/sas-logo.png'; import teradataLogo from '../../images/logos/teradata-logo.png'; import topazlabsLogo from '../../images/logos/topazlabs-logo.png'; - import ueLogo from '../../images/logos/ue-logo.png'; + import unrealengineLogo from '../../images/logos/ue-logo.png'; import usdaLogo from '../../images/logos/usda-logo.png'; import vespaLogo from '../../images/logos/vespa-logo.png'; import writerLogo from '../../images/logos/writer-logo.png'; import xilinxLogo from '../../images/logos/xilinx-logo.png'; - import huggingfaceLogo from '../../images/logos/huggingface-logo.png'; - import nvidiaLogo from '../../images/logos/nvidia.png'; - import oracleLogo from '../../images/logos/oracle-logo.png'; const testimonials = [ { @@ -61,6 +65,11 @@ src: ATLASLogo, alt: 'ATLAS' }, + { + href: './testimonials#Autodesk', + src: autodeskLogo, + alt: 'Autodesk' + }, { href: './testimonials#Bazaarvoice', src: bazaarvoiceLogo, @@ -86,6 +95,11 @@ src: deezerLogo, alt: 'Deezer' }, + { + href: './testimonials#Goodnotes', + src: goodnotesLogo, + alt: 'GoodNotes' + }, { href: './testimonials#Hugging%20Face', src: huggingfaceLogo, @@ -113,7 +127,7 @@ }, { href: './testimonials#Navitaire', - src: navitaireAmadeusLogo, + src: navitaireLogo, alt: 'Navitaire' }, { @@ -121,6 +135,11 @@ src: nvidiaLogo, alt: 'NVIDIA' }, + { + href: './testimonials#Apache%20OpenNLP', + src: opennlpLogo, + alt: 'Apache OpenNLP' + }, { href: './testimonials#Oracle', src: oracleLogo, @@ -128,7 +147,7 @@ }, { href: './testimonials#Peakspeed', - src: PeakSpeedLogo, + src: peakspeedLogo, alt: 'Peakspeed' }, { @@ -136,6 +155,11 @@ src: piecesLogo, alt: 'Pieces' }, + { + href: './testimonials#PTW%20Dosimetry', + src: ptwLogo, + alt: 'PTW Dosimetry' + }, { href: './testimonials#Redis', src: redisLogo, @@ -143,7 +167,7 @@ }, { href: './testimonials#Rockchip', - src: RockchipLogo, + src: rockchipLogo, alt: 'Rockchip' }, { @@ -168,7 +192,7 @@ }, { href: './testimonials#Unreal%20Engine', - src: ueLogo, + src: unrealengineLogo, alt: 'Unreal Engine' }, { diff --git a/src/routes/components/footer.svelte b/src/routes/components/footer.svelte index b030524976742..e6b855d0ca129 100644 --- a/src/routes/components/footer.svelte +++ b/src/routes/components/footer.svelte @@ -9,7 +9,7 @@