Skip to content
This repository has been archived by the owner on Oct 25, 2024. It is now read-only.

Intel® Extension for Transformers v1.0a Release

Pre-release
Pre-release
Compare
Choose a tag to compare
@kevinintel kevinintel released this 23 Nov 16:23
· 4 commits to v0.4rc2 since this release
59544b0
  • Highlights
  • Features
  • Productivity
  • Examples

Highlights

  • Intel® Extension for Transformers provides a rich set of model compression techniques and a leading sparsity-aware libraries and neural engine to accelerate the inference of Transformer-based models on Intel platforms. We published 2 papers on NeurIPS’2022 with the source code released:
    • Fast DistilBERT on CPUs: outperform existing state-of-the-art Neural Magic's DeepSparse runtime performance by up to 50%, and deliver 7x better performance on c6i.12xlarge (Ice Lake) than c6a.12xlarge (AMD Milan)
    • QuaLA-MiniLM: outperform BERT-base with ~3x reduced size and demonstrate up to 8.8x speedup with <1% accuracy loss on SQuAD1.1 task

Features

  • Pruning/Sparsity
    • Support Distributed Pruning on PyTorch
    • Support Distributed Pruning on TensorFlow
  • Quantization
    • Support Distributed Quantization on PyTorch
    • Support Distributed Quantization on TensorFlow
  • Distillation
    • Support Distributed Distillation on PyTorch
    • Support Distributed Distillation on TensorFlow
  • Compression Orchestration
    • Support Distributed Orchestration on PyTorch
  • Neural Architecture Search (NAS)
    • Support auto distillation with NAS and flash distillation on PyTorch
  • Length Adaptive Transformer (LAT)
    • Support Dynamic Transformer on SQuAD1.1 on PyTorch
  • Transformers-accelerated Neural Engine
    • Support inference with sparse GEMM fusion patterns
    • Support automatic benchmarking of sparse and dense mixed model
  • Transformers-accelerated Libraries
    • Support 1x4 block-wise sparse VNNI-INT8 GEMM kernels with post-ops
    • Support 1x16 block-wise sparse AMX-BF16 GEMM kernels with post-ops

Productivity

  • Support seamless Transformers-extended APIs
  • Support experimental model conversion from PyTorch INT8 model to ONNX INT8
  • Support VTune performance tracing for sparse GEMM kernels

Examples

Validated Configurations

  • Centos 8.4 & Ubuntu 20.04
  • Python 3.7, 3.8, 3.9, 3.10
  • TensorFlow 2.9.1, 2.10.0, Intel® Extension for TensorFlow 2.9.1, 2.10.0
  • PyTorch 1.12.0+cpu, 1.13.0+cpu, Intel® Extension for PyTorch 1.12.0