This repository has been archived by the owner on Oct 25, 2024. It is now read-only.
Intel® Extension for Transformers v1.0a Release
Pre-release
Pre-release
- Highlights
- Features
- Productivity
- Examples
Highlights
- Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
- Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
- QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task
Features
- Pruning/Sparsity
- Support Distributed Pruning on PyTorch
- Support Distributed Pruning on TensorFlow
- Quantization
- Support Distributed Quantization on PyTorch
- Support Distributed Quantization on TensorFlow
- Distillation
- Support Distributed Distillation on PyTorch
- Support Distributed Distillation on TensorFlow
- Compression Orchestration
- Support Distributed Orchestration on PyTorch
- Neural Architecture Search (NAS)
- Support auto distillation with NAS and flash distillation on PyTorch
- Length Adaptive Transformer (LAT)
- Support Dynamic Transformer on SQuAD1.1 on PyTorch
- Transformers-accelerated Neural Engine
- Support inference with sparse GEMM fusion patterns
- Support automatic benchmarking of sparse and dense mixed model
- Transformers-accelerated Libraries
- Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
- Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops
Productivity
- Support seamless Transformers-extended APIs
- Support experimental model conversion from PyTorch INT8 model to ONNX INT8
- Support VTune performance tracing for sparse GEMM kernels
Examples
- LAT examples for MiniLM (NeurIPS’2022)
- Fast DistilBert on CPUs (NeurIPS’2022)
- PyTorch distributed compression orchestration examples
- Post-training quantization for Transformers non-trainer API
- PyTorch auto distillation (NAS based) examples
- Multiple examples of Quantization/Pruning/Distillation on PyTorch and TensorFlow
- Post-training static quantization via Intel® Extension for PyTorch examples and deployment examples
Validated Configurations
- Centos 8.4 & Ubuntu 20.04
- Python 3.7, 3.8, 3.9, 3.10
- TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
- PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0