Release Intel® Extension for Transformers v1.0a Release · intel/intel-extension-for-transformers

Highlights

Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
- Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
- QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task

Features

Pruning/Sparsity
- Support Distributed Pruning on PyTorch
- Support Distributed Pruning on TensorFlow
Quantization
- Support Distributed Quantization on PyTorch
- Support Distributed Quantization on TensorFlow
Distillation
- Support Distributed Distillation on PyTorch
- Support Distributed Distillation on TensorFlow
Compression Orchestration
- Support Distributed Orchestration on PyTorch
Neural Architecture Search (NAS)
- Support auto distillation with NAS and flash distillation on PyTorch
Length Adaptive Transformer (LAT)
- Support Dynamic Transformer on SQuAD1.1 on PyTorch
Transformers-accelerated Neural Engine
- Support inference with sparse GEMM fusion patterns
- Support automatic benchmarking of sparse and dense mixed model
Transformers-accelerated Libraries
- Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
- Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops

Productivity

Examples

Validated Configurations

Provide feedback