Releases · intel/intel-extension-for-transformers

This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

14 Jul 10:26

kevinintel

v1.1

4269f96

Intel® Extension for Transformers v1.1 Release

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types

Features

Model Optimization
- Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
- Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
- Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
- Enable QAT for Stable Diffusion (commit 2e2efd)
- Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
Transformers-accelerated Neural Engine
- Support PyTorch model as input of Neural Engine (commit e83a51, 3625db)
- Inference with cpp graph: MPT-7B, LLAMA-7B, GPT-NeoX-20B (commit 970bfa), Falcon-7B (commit 762723)
- Inference with weight-only compression (commit d87132, 0065db, d30eff)
- Reduce memory usage of inference (commit 36f3e9, 2dc594, 3f6b47, 5f75df, 7860f9)
- Stable Diffusion on Windows (commit 52d5e6)
- MHA for Bert (commit 59af3af)
Transformers-accelerated Libraries
- MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
- Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
- Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
- Support dynamic quantization op (commit 6fcc15)
- Add AVX2 kernels for Windows (commit bc313c)

Productivity

Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
Enable docker for Chatbot (commit 6b9522, 37b455)
Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
Update Torch and TensorFlow (commit f54817)
Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
Add summarization evaluation for PyTorch (commit 062e62)

Examples

Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
Electra fp32 & bf16 inference (commit e09c96)
GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
Onnx whisper-large quantization (commit 038be0)
8-layers MiniLM inference (commit 0dd104)
Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)

Bug Fixing

Fix Neural Engine error with gcc13 (commit 37a4a3) and GPU compilation error (commit 0f38eb)
Fix quantization for transfor...

Assets 2

02 Jun 09:31

kevinintel

v1.0.1

ab7cf63

Intel® Extension for Transformers v1.0.1 Release

Bug Fixing
Improvement

Bug Fixing

Fix BERT Large accuracy issue (commit ddc4a5)
Fix Dynamic Quantization UnitTest (commit d83040)

Improvement

Enable new fusion patterns for GPT-J (commit c73605 )
ChatBot Refine Data Load and Data Clean (commit f70205, commit 0997ac)

Validated Configurations

Centos 8.4 & Ubuntu 22.04 & Windows 10
Python 3.8, 3.9
TensorFlow 2.10.1
PyTorch 1.13.1+cpu
Intel® Extension for PyTorch 1.13.1+cpu

Assets 2

04 Apr 17:03

kevinintel

v1.0.0

c9ec6a4

Intel® Extension for Transformers v1.0.0 Release

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Provide the optimal model packages for large language model (LLM) such as GPT-J, GPT-NEOX, T5-large/base, Flan-T5, and Stable Diffusion
Provide the end-to-end optimized workflows such as SetFit-based sentiment analysis, Document Level Sentiment Analysis (DLSA), and Length Adaptive Transformer for inference
Support NeuralChat, a custom Chatbot based on domain knowledge fine-tuning and demonstrate less than one hour fine-tuning with PEFT on 4 SPR nodes
Demonstrate the industry-leading sparse model inference solution in MLPerf v3.0 open submission with up to 1.6x over other submissions

Features

Model Optimization
- LLM quantization including GPT-J (6B), GPT-NEOX (2.7B), T5-large, T5-base, Flan-T5, BLOOM-176B
- Enable basic Neural Architecture Search (commit 6cae)
Transformers-accelerated Neural Engine
- Support runtime dynamic quantization (commit 46fa 41c4)
- Enable GPT-J FP32/BF16/INT8 text generation inference (commit ac2c)
- Enable Stable Diffusion BF16/FP32 text-to-image inference (commit 56cf)
- Support OpenNMT FP32 to ONNX with good accuracy (commit 34d8)
Transformers-accelerated Libraries
- CPU Backend: MHA fusion for LLM to improve performance (commit 7c3d)
- GPU Backend: Supports OpenCL infrastructure, and provides matmul implementation (commit 5a60)

Productivity

Support native PyTorch model as input of Neural Engine (commit bc38)
Refine the Benchmark API to provide apple-to-apple benchmark ability. (commit e135)
Simplify end-to-end example usage (commit 6b9c)
N in M/ N x M PyTorch Pruning API enhancement (commit da4d)
Deliver engine-only wheel with size reduce 60% (commit 02ac)

Examples

End-to-end solution for Length Adaptive with Neural Engine, achieves over 11x speed up compared with BERT Base on SPR (commit 95c6)
End-to-end Documentation Level Sentiment Analysis(DLSA) workflow (commit 154a)
N in M/ N x M BERT Large and BERT Base pruning in PyTorch (commit da4d)
Sparse pruning example for Longformer with 80% sparsity (commit 5c5a)
Distillation for quantization for BERT and Stable Diffusion (commit 8856 4457)
Smooth quantization with BLOOM (commit edc9)
Longformer quantization with question-answering task (commit 8805)
Provide SETFIT workflow notebook (commit 6b9c 2851)
Support Text Generation task (commit c593)

Bug Fixing

Enhance BERT QAT tuning duration (commit 6b9c)
Fix Length Adaptive Transformer regression (commit 5473)
Fix accelerated lib compile error when enabling Vtune (commit b5cd)

Documentation

Refine contents of all readme files
API Helper based on GitHub io page (commit e107 )
devcatalog for Mt. Whitney (commit acb6)

Validated Configurations

Centos 8.4 & Ubuntu 20.04 & Windows 10
Python 3.7, 3.8, 3.9, 3.10
Intel® Extension for TensorFlow 2.10.1, 2.11.0
PyTorch 1.12.0+cpu, 1.13.0+cpu
Intel® Extension for PyTorch 1.12.0+cpu,1.13.0+cpu

Assets 2

12 Dec 02:56

kevinintel

v1.0b

089025a

Intel® Extension for Transformers v1.0b Release Pre-release

Pre-release

Highlights
Features
Productivity
Examples
Bug Fixing
Documentation

Highlights

Intel® Extension for Transformers provides more compression examples for popular applications like Stable Diffusion. For Stable Diffusion, we support INT8 quantization with PyTorch and BF16 fine-tune with Intel ® Extension for PyTorch.

Features

Pruning/Sparsity
- Support structured sparsity pattern N:M on PyTorch (25d5e4b)
- Support structured sparsity pattern NxM on PyTorch (25d5e4b)
Transformers-accelerated Neural Engine
- Support inference on Windows (fc580d5)
Transformers-accelerated Libraries
- Support INT8 Softmax operator (fece837)

Productivity

Simplify the integration with Alibaba BladeDISC

Examples

Support INT8 quantization for large language model (T5-base example) with PyTorch
Support INT8 Vision Transformer examples (ViT-base and ViT-large) in Neural Engine
Support FP32 LAT example in Neural Engine
Support INT8 quantization of 5 top HuggingFace TensorFlow models

Bug Fixing

Fix Protobuf and Onnx version dependency issue
Fix memory leak in Neural Engine

Documentation

Create Notebook for Pruning/Compression Orchestration/IPEX Quantization
Refine the user guide and compression example

Validated Configurations

Centos 8.4 & Ubuntu 20.04 & Windows 10
Python 3.7, 3.8, 3.9
Intel® Extension for TensorFlow 2.9.1, 2.10.0
PyTorch 1.11.0+cpu,1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0+cpu ,1.13.0+cpu

Assets 2

23 Nov 16:23

kevinintel

v1.0a

59544b0

Intel® Extension for Transformers v1.0a Release Pre-release

Pre-release

Highlights
Features
Productivity
Examples

Highlights

Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
- Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
- QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task

Features

Pruning/Sparsity
- Support Distributed Pruning on PyTorch
- Support Distributed Pruning on TensorFlow
Quantization
- Support Distributed Quantization on PyTorch
- Support Distributed Quantization on TensorFlow
Distillation
- Support Distributed Distillation on PyTorch
- Support Distributed Distillation on TensorFlow
Compression Orchestration
- Support Distributed Orchestration on PyTorch
Neural Architecture Search (NAS)
- Support auto distillation with NAS and flash distillation on PyTorch
Length Adaptive Transformer (LAT)
- Support Dynamic Transformer on SQuAD1.1 on PyTorch
Transformers-accelerated Neural Engine
- Support inference with sparse GEMM fusion patterns
- Support automatic benchmarking of sparse and dense mixed model
Transformers-accelerated Libraries
- Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
- Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops

Productivity

Support seamless Transformers-extended APIs
Support experimental model conversion from PyTorch INT8 model to ONNX INT8
Support VTune performance tracing for sparse GEMM kernels

Examples

LAT examples for MiniLM (NeurIPS’2022)
Fast DistilBert on CPUs (NeurIPS’2022)
PyTorch distributed compression orchestration examples
Post-training quantization for Transformers non-trainer API
PyTorch auto distillation (NAS based) examples
Multiple examples of Quantization/Pruning/Distillation on PyTorch and TensorFlow
Post-training static quantization via Intel® Extension for PyTorch examples and deployment examples

Validated Configurations

Centos 8.4 & Ubuntu 20.04
Python 3.7, 3.8, 3.9, 3.10
TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0

Assets 2

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Releases: intel/intel-extension-for-transformers

Intel® Extension for Transformers v1.1 Release

Intel® Extension for Transformers v1.0.1 Release

Intel® Extension for Transformers v1.0.0 Release

Intel® Extension for Transformers v1.0b Release

Intel® Extension for Transformers v1.0a Release