This repository contains benchmark code of int8 inference speed of TVM for the blog post Automating Optimization of Quantized Deep Learning Models on CUDA. The benchmark of MXNet and TensorRT is provided as baseline.
The benchmark is conducted using tvm@e22b58. (This is an outdated version. Please checkout this branch to run with recent tvm version.)
LLVM and CUDA need to be enabled.
Compute Capability 6.1 CUDA device is required to support the dp4a
instruction.
We only provide auto-tuning logs on NVIDIA GTX 1080. To run on other devices, you can follow the AutoTVM tutorial to run auto-tuning.
python3 run_tvm.py --log_file logs/history_best_1080.log
MXNet 1.4, cuDNN 7.3+ are required.
python3 run_mxnet.py
TensorRT 5 is required. We use onnx models as input. The onnx models will be generated from MXNet when running the benchmark script.
cd TensorRT; make; cd -;
python3 run_tensorrt.py