This folder contains scripts to fine-tune BART model on Intel® Gaudi® AI accelerator. To obtain model performance data, refer to the Intel Gaudi Model Performance Data page. Before you get started, make sure to review the Supported Configurations.
For more information about training deep learning models using Gaudi, visit developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Training Examples
- Supported Configurations
- Changelog
- Known Issues
BART, Bidirectional and Auto-Regressive Transformers, is proposed in this paper: Denoising Sequence-to-Sequence Pre-training for Natural Language Generation, Translation, and Comprehension, ACL 2020. It is a denoising autoencoder that maps a corrupted document to the original document it was derived from. BART is implemented as a sequence-to-sequence model with a bidirectional encoder over corrupted text and a left-to-right autoregressive decoder. According to the paper, BART's architecture is related to that used in BERT, with these differences: (1) each layer of the decoder additionally performs cross-attention over the final hidden layer of the encoder; and (2) BERT uses an additional feed-forward network before wordprediction, which BART does not. BART contains roughly 10% more parameters than the equivalently sized BERT model.
- Suited for tasks:
- Text paraphrasing: The model aims to generate paraphrases of the given input sentence.
- Text summarization: The model aims to generate a summary of the given input sentence.
- Uses optimizer: FusedAdamW (AdamW: “ADAM with Weight Decay Regularization”).
- Based on model weights trained with pre-training.
- Light-weight: The training takes a few minutes.
The BART demo uses training scripts from simple transformers https://github.com/ThilinaRajapakse/simpletransformers.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
The guides will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that
matches your Intel Gaudi software version. You can run the
hl-smi
utility to determine the Intel Gaudi software version.
git clone -b [Intel Gaudi software] https://github.com/HabanaAI/Model-References
Then, navigate to the BART model directory:
cd Model-References/PyTorch/nlp/BART/simpletransformers
Install the python packages required for fine-tuning:
cd Model-References/PyTorch/nlp/BART/simpletransformers
pip install -e .
pip install bert_score
Public datasets can be downloaded with this script:
bash ./examples/seq2seq/paraphrasing/data_download.sh
Note: Going forward it is assumed that the dataset is located in ./data
directory.
Run training on 1 HPU - Lazy mode:
- 1 HPU, BART fine-tuning on the dataset using BF16 mixed precision:
PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_bart.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_bart.txt $PYTHON examples/seq2seq/paraphrasing/train.py --use_habana --no_cuda --use_fused_adam --use_fused_clip_norm --max_seq_length 128 --train_batch_size 32 --num_train_epochs 5 --logging_steps 50 --save_best_model --output_dir output --bf16 autocast
- 1 HPU, BART fine-tuning on the dataset using FP32 data type:
$PYTHON examples/seq2seq/paraphrasing/train.py --use_habana --no_cuda --use_fused_adam --use_fused_clip_norm --max_seq_length 128 --train_batch_size 32 --num_train_epochs 5 --logging_steps 50 --save_best_model --output_dir output
Run training on 8 HPUs:
To run multi-card demo, make sure the host machine has 512 GB of RAM installed. Modify the docker run command to pass 8 Gaudi cards to the docker container. This ensures the docker has access to all the 8 cards required for multi-card training.
NOTE: mpirun map-by PE attribute value may vary on your setup. For the recommended calculation, refer to the instructions detailed in mpirun Configuration.
-
8 HPUs on a single server, BF16, batch size 32, Lazy mode:
PT_HPU_AUTOCAST_LOWER_PRECISION_OPS_LIST=ops_bf16_bart.txt PT_HPU_AUTOCAST_FP32_OPS_LIST=ops_fp32_bart.txt mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root $PYTHON examples/seq2seq/paraphrasing/train.py --use_habana --no_cuda --use_fused_adam --use_fused_clip_norm --max_seq_length 128 --train_batch_size 32 --num_train_epochs 5 --logging_steps 50 --save_best_model --output_dir /tmp/multicards --bf16 autocast --distributed
-
8 HPUs on a single server, FP32, batch size 32, Lazy mode:
mpirun -n 8 --bind-to core --map-by socket:PE=6 --rank-by core --report-bindings --allow-run-as-root $PYTHON examples/seq2seq/paraphrasing/train.py --use_habana --no_cuda --use_fused_adam --use_fused_clip_norm --max_seq_length 128 --train_batch_size 32 --num_train_epochs 5 --logging_steps 50 --save_best_model --output_dir /tmp/multicards --distributed
Device | Intel Gaudi Software Version | PyTorch Version |
---|---|---|
Gaudi | 1.16.2 | 2.2.2 |
- Eager mode support is deprecated.
- Removed PT_HPU_LAZY_MODE environment variable.
- Removed flag lazy_mode.
- Removed HMP; switched to autocast.
- Updated run commands.
- Enabled PyTorch autocast on Gaudi.
- Changed BART distributed API to initialize_distributed_hpu.
- Removed unnecessary mark_step.
- Removed wrapper script run_bart.py.
- Added support for reducing the print frequency of Running Loss to the frequency of logging_steps.
The following changes have been added to scripts & source: modifications to the simple transformer source:
- Added Gaudi support (seq2seq_model.py).
- Modifications for saving checkpoint: Bring tensors to CPU and save (seq2seq_model.py).
- Introduced BF16 mixed precision, adding ops lists for BF16 and FP32 (seq2seq_model.py, ops_bf16_bart.txt, ops_fp32_bart.txt).
- Change for supporting HMP disable for optimizer.step (seq2seq_model.py).
- Use fused AdamW optimizer on Gaudi device (seq2seq_model.py, train.py).
- Use fused clip norm for grad clipping on Gaudi device (seq2seq_model.py, train.py).
- Modified training script to use mpirun for distributed training (train.py).
- Gradients are used as views using gradient_as_bucket_view (seq2seq_model.py).
- Default allreduce bucket size set to 200MB for better performance in distributed training (seq2seq_model.py).
- Added changes to support Lazy mode with required mark_step (seq2seq_model.py).
- Only print and save in the master process (seq2seq_model.py).
- Added prediction (sentence generation) metrics (seq2seq_model.py).
- Modified training script to use
habana_dataloader
(seq2seq_model.py). - Add data_dir as an input argument for data directory.
- Added this README.
- Placing mark_step() arbitrarily may lead to undefined behavior. Recommend to keep mark_step() as shown in provided scripts.
- Sentence generation (prediction) is not enabled in this release. We plan to enable it in next release.