This repository provides scripts to train YOLOX model on Intel® Gaudi® AI accelerator to achieve state-of-the-art accuracy. To obtain model performance data, refer to the Intel Gaudi Model Performance Data page. For more information about training deep learning models using Gaudi, visit developer.habana.ai. Before you get started, make sure to review the Supported Configurations.
The YOLOX demo included in this release is YOLOX-S in lazy mode training for different batch sizes with FP32 and BF16 mixed precision.
- Model-References
- Model Overview
- Setup
- Training Examples
- Supported Configurations
- Changelog
- Known Issues
YOLOX is an anchor-free object detector that adopts the architecture of YOLO with DarkNet53 backbone. The anchor-free mechanism greatly reduces the number of model parameters and therefore simplifies the detector. Additionally, YOLOX also provides improvements to the previous YOLO series such as decoupled head, advanced label assigning strategy, and strong data augmentation. The decoupled head contains a 1x1 conv layer, followed by two parallel branches with two 3x3 conv layers for classification and regression tasks respectively, which helps the model converge faster with better accuracy. The advanced label assignment, SimOTA, selects the top k predictions with the lowest cost as the positive samples for a ground truth object. SimOTA not only reduces training time by approximating the assignment instead of using an optimization algorithm, but also improves AP of the model. Additionally, Mosaic and MixUp image augmentation are applied to the training process to further improve the accuracy. Equipped with these latest advanced techniques, YOLOX remarkably achieves a better trade-off between training speed and accuracy than other counterparts in all model sizes.
This repository is an implementation of PyTorch version YOLOX, based on the source code from https://github.com/Megvii-BaseDetection/YOLOX. More details can be found in the paper YOLOX: Exceeding YOLO Series in 2021 by Zhen Ge, Songtao Liu, Feng Wang, Zeming Li, and Jian Sun.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
The guides will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that
matches your Intel Gaudi software version. You can run the
hl-smi
utility to determine the Intel Gaudi software version
git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References
Go to PyTorch YOLOX directory:
cd Model-References/PyTorch/computer_vision/detection/yolox
Install the required packages and add current directory to PYTHONPATH:
pip install -r requirements.txt
pip install -v -e .
export PYTHONPATH=$PWD:$PYTHONPATH
Download COCO 2017 dataset from http://cocodataset.org using the following commands:
cd Model-References/PyTorch/computer_vision/detection/yolox
source download_dataset.sh
You can either set the dataset location to the YOLOX_DATADIR
environment variable:
export YOLOX_DATADIR=/data/COCO
Or create a sub-directory, datasets
, and create a symbolic link from the COCO dataset path to the 'datasets' sub-directory.
mkdir datasets
ln -s /data/COCO ./datasets/COCO
Alternatively, you can pass the COCO dataset location to the --data_dir
argument of the training commands.
NOTE: YOLOX only supports Lazy mode.
Run training on 1 HPU:
-
FP32 data type, train for 500 steps:
$PYTHON tools/train.py \ --name yolox-s --devices 1 --batch-size 16 --data_dir /data/COCO --hpu steps 500 output_dir ./yolox_output
-
BF16 data type. train for 500 steps:
PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt $PYTHON tools/train.py \ --name yolox-s --devices 1 --batch-size 16 --data_dir /data/COCO --hpu --autocast \ steps 500 output_dir ./yolox_output
Run training on 8 HPUs:
NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
-
FP32 data type, train for 2 epochs:
export MASTER_ADDR=localhost export MASTER_PORT=12355 mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \ $PYTHON tools/train.py \ --name yolox-s --devices 8 --batch-size 128 --data_dir /data/COCO --hpu max_epoch 2 output_dir ./yolox_output
-
BF16 data type. train for 2 epochs:
export MASTER_ADDR=localhost export MASTER_PORT=12355 PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \ $PYTHON tools/train.py \ --name yolox-s --devices 8 --batch-size 128 --data_dir /data/COCO --hpu --autocast\ max_epoch 2 output_dir ./yolox_output
-
BF16 data type, train for 300 epochs:
export MASTER_ADDR=localhost export MASTER_PORT=12355 PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_yolox.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_yolox.txt mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root \ $PYTHON tools/train.py \ --name yolox-s --devices 8 --batch-size 128 --data_dir /data/COCO --hpu --autocast \ print_interval 100 max_epoch 300 save_history_ckpt False eval_interval 300 output_dir ./yolox_output
Device | Intel Gaudi Software Version | PyTorch Version |
---|---|---|
Gaudi | 1.18.0 | 2.4.0 |
- Removed PT_HPU_LAZY_MODE environment variable.
- Removed flag use_lazy_mode.
- Removed HMP data type.
- Updated run commands which allows for overriding the default lower precision and FP32 lists of ops.
- Enabled mixed precision training using PyTorch autocast on Gaudi.
The following are the changes made to the training scripts:
-
Added source code to enable training on CPU.
-
Added source code to support Gaudi devices.
-
Enabled HMP data type.
-
Added support to run training in Lazy mode.
-
Re-implemented loss function with TorchScript and deployed the function to CPU.
-
Enabled distributed training with HCCL backend on 8 HPUs.
-
mark_step() is called to trigger execution.
-
Eager mode is not supported.