This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
Releases: intel/intel-extension-for-transformers
Releases · intel/intel-extension-for-transformers
Intel® Extension for Transformers v1.1 Release
- Highlights
- Features
- Productivity
- Examples
- Bug Fixing
- Documentation
Highlights
- Created NeuralChat, the first 7B commercially friendly chat model ranked in top of LLM leaderboard
- Supported efficient fine-tuning and inference on Xeon SPR and Habana Gaudi
- Enabled 4-bits LLM inference in plain C++ implementation, outperforming llama.cpp
- Supported quantization for broad LLMs with the improved lm-evaluation-harness for multiple frameworks and data types
Features
- Model Optimization
- Language modeling quantization for OPT-2.7B, OPT-6.7B, LLAMA-7B (commit 6a9608), MPT-7B and Falcon-7B (commit f6ca74)
- Text2text-generation quantization for T5, Flan-T5 (commit a9b69b)
- Text-generation quantization for Bloom (commit e44270), MPT (commit 469ac6)
- Enable QAT for Stable Diffusion (commit 2e2efd)
- Replace PyTorch Pruner with INC Pruner (commit 9ea1e3)
- Transformers-accelerated Neural Engine
- Support PyTorch model as input of Neural Engine (commit e83a51, 3625db)
- Inference with cpp graph: MPT-7B, LLAMA-7B, GPT-NeoX-20B (commit 970bfa), Falcon-7B (commit 762723)
- Inference with weight-only compression (commit d87132, 0065db, d30eff)
- Reduce memory usage of inference (commit 36f3e9, 2dc594, 3f6b47, 5f75df, 7860f9)
- Stable Diffusion on Windows (commit 52d5e6)
- MHA for Bert (commit 59af3af)
- Transformers-accelerated Libraries
- MHA kernels for static, dynamic quantization and bf16 (commit 0d0932, e61e4b)
- Support dynamic quantization matmul and post-op (commit 4cb9e4, cf0400, 9acfe1)
- Int4 weight-only kernels (commit 3b7665) and fusion (commit f00d87)
- Support dynamic quantization op (commit 6fcc15)
- Add AVX2 kernels for Windows (commit bc313c)
Productivity
- Enable Lora fine-tuning(commit 664f4b), multi-nodes fine-tuning(commit 6288fd) and Xeon, Habana inference (commit 8ea55b) for Chatbot
- Enable docker for Chatbot (commit 6b9522, 37b455)
- Support Parameter-Efficient Fine-Tuning (PEFT) (commit 27bd7f)
- Update Torch and TensorFlow (commit f54817)
- Add Harness evaluation for PyTorch text-generation/language modeling (commit 736921, c7c557, b492f5) and onnx (commit a944fa)
- Add summarization evaluation for PyTorch (commit 062e62)
Examples
- Early Exit: TangoBERT, Separating Weights for Early-Exit Transformers (SWEET) (commit dfbdc5, c0eaa5)
- Electra fp32 & bf16 inference (commit e09c96)
- GPT-NeoX and Dolly-v2-7B text-generation inference (commit 402bb9)
- Stable Diffusion v2.1 inference (commit 5affab), image to image(commit a13e11), inference with dynamic quantization (commit bfcb2e)
- Onnx whisper-large quantization (commit 038be0)
- 8-layers MiniLM inference (commit 0dd104)
- Add compression aware training (commit dfb53f), sparse aware training(commit 7b28ef) and fine-tuning and inference workflows (commit bf666c)
Bug Fixing
Intel® Extension for Transformers v1.0.1 Release
- Bug Fixing
- Improvement
Bug Fixing
Improvement
- Enable new fusion patterns for GPT-J (commit c73605 )
- ChatBot Refine Data Load and Data Clean (commit f70205, commit 0997ac)
Validated Configurations
- Centos 8.4 & Ubuntu 22.04 & Windows 10
- Python 3.8, 3.9
- TensorFlow 2.10.1
- PyTorch 1.13.1+cpu
- Intel® Extension for PyTorch 1.13.1+cpu
Intel® Extension for Transformers v1.0.0 Release
- Highlights
- Features
- Productivity
- Examples
- Bug Fixing
- Documentation
Highlights
- Provide the optimal model packages for large language model (LLM) such as GPT-J, GPT-NEOX, T5-large/base, Flan-T5, and Stable Diffusion
- Provide the end-to-end optimized workflows such as SetFit-based sentiment analysis, Document Level Sentiment Analysis (DLSA), and Length Adaptive Transformer for inference
- Support NeuralChat, a custom Chatbot based on domain knowledge fine-tuning and demonstrate less than one hour fine-tuning with PEFT on 4 SPR nodes
- Demonstrate the industry-leading sparse model inference solution in MLPerf v3.0 open submission with up to 1.6x over other submissions
Features
- Model Optimization
- LLM quantization including GPT-J (6B), GPT-NEOX (2.7B), T5-large, T5-base, Flan-T5, BLOOM-176B
- Enable basic Neural Architecture Search (commit 6cae)
- Transformers-accelerated Neural Engine
- Transformers-accelerated Libraries
Productivity
- Support native PyTorch model as input of Neural Engine (commit bc38)
- Refine the Benchmark API to provide apple-to-apple benchmark ability. (commit e135)
- Simplify end-to-end example usage (commit 6b9c)
- N in M/ N x M PyTorch Pruning API enhancement (commit da4d)
- Deliver engine-only wheel with size reduce 60% (commit 02ac)
Examples
- End-to-end solution for Length Adaptive with Neural Engine, achieves over 11x speed up compared with BERT Base on SPR (commit 95c6)
- End-to-end Documentation Level Sentiment Analysis(DLSA) workflow (commit 154a)
- N in M/ N x M BERT Large and BERT Base pruning in PyTorch (commit da4d)
- Sparse pruning example for Longformer with 80% sparsity (commit 5c5a)
- Distillation for quantization for BERT and Stable Diffusion (commit 8856 4457)
- Smooth quantization with BLOOM (commit edc9)
- Longformer quantization with question-answering task (commit 8805)
- Provide SETFIT workflow notebook (commit 6b9c 2851)
- Support Text Generation task (commit c593)
Bug Fixing
- Enhance BERT QAT tuning duration (commit 6b9c)
- Fix Length Adaptive Transformer regression (commit 5473)
- Fix accelerated lib compile error when enabling Vtune (commit b5cd)
Documentation
- Refine contents of all readme files
- API Helper based on GitHub io page (commit e107 )
- devcatalog for Mt. Whitney (commit acb6)
Validated Configurations
- Centos 8.4 & Ubuntu 20.04 & Windows 10
- Python 3.7, 3.8, 3.9, 3.10
- Intel® Extension for TensorFlow 2.10.1, 2.11.0
- PyTorch 1.12.0+cpu, 1.13.0+cpu
- Intel® Extension for PyTorch 1.12.0+cpu,1.13.0+cpu
Intel® Extension for Transformers v1.0b Release
- Highlights
- Features
- Productivity
- Examples
- Bug Fixing
- Documentation
Highlights
- Intel® Extension for Transformers provides more compression examples for popular applications like Stable Diffusion. For Stable Diffusion, we support INT8 quantization with PyTorch and BF16 fine-tune with Intel ® Extension for PyTorch.
Features
- Pruning/Sparsity
- Transformers-accelerated Neural Engine
- Support inference on Windows (fc580d5)
- Transformers-accelerated Libraries
- Support INT8 Softmax operator (fece837)
Productivity
- Simplify the integration with Alibaba BladeDISC
Examples
- Support INT8 quantization for large language model (T5-base example) with PyTorch
- Support INT8 Vision Transformer examples (ViT-base and ViT-large) in Neural Engine
- Support FP32 LAT example in Neural Engine
- Support INT8 quantization of 5 top HuggingFace TensorFlow models
Bug Fixing
- Fix Protobuf and Onnx version dependency issue
- Fix memory leak in Neural Engine
Documentation
- Create Notebook for Pruning/Compression Orchestration/IPEX Quantization
- Refine the user guide and compression example
Validated Configurations
- Centos 8.4 & Ubuntu 20.04 & Windows 10
- Python 3.7, 3.8, 3.9
- Intel® Extension for TensorFlow 2.9.1, 2.10.0
- PyTorch 1.11.0+cpu,1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0+cpu ,1.13.0+cpu
Intel® Extension for Transformers v1.0a Release
- Highlights
- Features
- Productivity
- Examples
Highlights
- Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
- Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
- QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task
Features
- Pruning/Sparsity
- Support Distributed Pruning on PyTorch
- Support Distributed Pruning on TensorFlow
- Quantization
- Support Distributed Quantization on PyTorch
- Support Distributed Quantization on TensorFlow
- Distillation
- Support Distributed Distillation on PyTorch
- Support Distributed Distillation on TensorFlow
- Compression Orchestration
- Support Distributed Orchestration on PyTorch
- Neural Architecture Search (NAS)
- Support auto distillation with NAS and flash distillation on PyTorch
- Length Adaptive Transformer (LAT)
- Support Dynamic Transformer on SQuAD1.1 on PyTorch
- Transformers-accelerated Neural Engine
- Support inference with sparse GEMM fusion patterns
- Support automatic benchmarking of sparse and dense mixed model
- Transformers-accelerated Libraries
- Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
- Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops
Productivity
- Support seamless Transformers-extended APIs
- Support experimental model conversion from PyTorch INT8 model to ONNX INT8
- Support VTune performance tracing for sparse GEMM kernels
Examples
- LAT examples for MiniLM (NeurIPS’2022)
- Fast DistilBert on CPUs (NeurIPS’2022)
- PyTorch distributed compression orchestration examples
- Post-training quantization for Transformers non-trainer API
- PyTorch auto distillation (NAS based) examples
- Multiple examples of Quantization/Pruning/Distillation on PyTorch and TensorFlow
- Post-training static quantization via Intel® Extension for PyTorch examples and deployment examples
Validated Configurations
- Centos 8.4 & Ubuntu 20.04
- Python 3.7, 3.8, 3.9, 3.10
- TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
- PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0