Skip to content

Latest commit

 

History

History
40 lines (24 loc) · 4.28 KB

Understandingtech.md

File metadata and controls

40 lines (24 loc) · 4.28 KB

Key technologies mentioned include

  1. DirectML - a low-level API for hardware-accelerated machine learning built on top of DirectX 12.
  2. CUDA - a parallel computing platform and application programming interface (API) model developed by Nvidia, enabling general-purpose processing on graphics processing units (GPUs).
  3. ONNX (Open Neural Network Exchange) - an open format designed to represent machine learning models that provides interoperability between different ML frameworks.
  4. GGUF (Generic Graph Update Format) - a format used for representing and updating machine learning models, particularly useful for smaller language models that can run effectively on CPUs with 4-8bit quantization.

DirectML

DirectML is a low-level API that enables hardware-accelerated machine learning. It’s built on top of DirectX 12 to utilize GPU acceleration and is vendor-agnostic, meaning it doesn’t require code changes to work across different GPU vendors. It’s primarily used for model training and inferencing workloads on GPUs.

As for the hardware support, DirectML is designed to work with a wide range of GPUs, including AMD integrated and discrete GPUs, Intel integrated GPUs, and NVIDIA discrete GPUs. It’s part of the Windows AI Platform and is supported on Windows 10 & 11, allowing for model training and inferencing on any Windows device.

There have been updates and opportunities related to DirectML, such as supporting up to 150 ONNX operators and being used by both the ONNX runtime and WinML. It’s backed by major Integrated Hardware Vendors (IHVs), each implementing various metacommands.

CUDA

CUDA, which stands for Compute Unified Device Architecture, is a parallel computing platform and application programming interface (API) model created by Nvidia. It allows software developers to use a CUDA-enabled graphics processing unit (GPU) for general purpose processing – an approach termed GPGPU (General-Purpose computing on Graphics Processing Units). CUDA is a key enabler of Nvidia’s GPU acceleration and is widely used in various fields, including machine learning, scientific computing, and video processing.

The hardware support for CUDA is specific to Nvidia’s GPUs, as it is a proprietary technology developed by Nvidia. Each architecture supports specific versions of the CUDA toolkit, which provides the necessary libraries and tools for developers to build and run CUDA applications.

ONNX

ONNX (Open Neural Network Exchange) is an open format designed to represent machine learning models. It provides a definition of an extensible computation graph model, as well as definitions of built-in operators and standard data types. ONNX allows developers to move models between different ML frameworks, enabling interoperability and making it easier to create and deploy AI applications.

Phi3 mini can run with ONNX Runtime on CPU and GPU across devices, including server platforms, Windows, Linux and Mac desktops, and mobile CPUs. The optimized configurations we have added are

  • ONNX models for int4 DML: Quantized to int4 via AWQ
  • ONNX model for fp16 CUDA
  • ONNX model for int4 CUDA: Quantized to int4 via RTN
  • ONNX model for int4 CPU and Mobile: Quantized to int4 via RTN

Llama.cpp

Llama.cpp is an open-source software library written in C++. It performs inference on various Large Language Models (LLMs), including Llama. Developed alongside the ggml library (a general-purpose tensor library), llama.cpp aims to provide faster inference and lower memory usage compared to the original Python implementation. It supports hardware optimization, quantization, and offers a simple API and examples3. If you’re interested in efficient LLM inference, llama.cpp is worth exploring as Phi3 can run Llama.cpp.

GGUF

GGUF (Generic Graph Update Format) is a format used for representing and updating machine learning models. It is particularly useful for smaller language models (SLMs) that can run effectively on CPUs with 4-8bit quantization. GGUF is beneficial for rapid prototyping and running models on edge devices or in batch jobs like CI/CD pipelines.