This folder contains scripts to pre-train BERT-1.5B and BERT-5B models using DeepSpeed on Intel® Gaudi® AI accelerator to achieve an accurate training of a large scale model. To obtain model performance data, refer to the Intel Gaudi Model Performance Data page. Before you get started, make sure to review the Supported Configurations.
For more information about training deep learning models using Gaudi, visit developer.habana.ai.
- Model-References
- Model Overview
- Setup
- Pre-Training the Model
- Advanced
- Supported Configurations
- Changelog
- Known Issues
Bidirectional Encoder Representations from Transformers (BERT) is a technique for natural language processing (NLP) pre-training developed by Google. The original English-language BERT model comes with two pre-trained general types: (1) the BERT BASE model, a 12-layer, 768-hidden, 12-heads, 110M parameter neural network architecture, and (2) the BERT LARGE model, a 24-layer, 1024-hidden, 16-heads, 340M parameter neural network architecture; both of which were trained on the BooksCorpus with 800M words, and a version of the English Wikipedia with 2,500M words. The pre-training modeling scripts are derived from a clone of https://github.com/NVIDIA/DeepLearningExamples.git.
In this pre-training model script we will introduce the following models:
- BERT-1.5B: 48-layer, 1600-hidden, 25-heads, 1.5B parameters neural network architecture.
- BERT-5B: a 63-layer, 2560-hidden, 40-heads, 5B parameters neural network architecture.
BERT-1.5B and BERT-5B pre-training with DeepSpeed library includes the following configurations:
- Multi-card data parallel with Zero1 and BF16 (BERT-1.5B).
- Multi-server data parallel with Zero1 and BF16 (BERT-1.5B and BERT-5B).
- Suited for datasets:
wiki
,bookswiki
(combination of BooksCorpus and Wiki datasets). - BERT-1.5B uses optimizer: LANS
- BERT-5B uses optimizer: LANS
- Consists of two tasks:
- Task 1 - Masked Language Model - where when given a sentence, a randomly chosen word is guessed.
- Task 2 - Next Sentence Prediction - where the model guesses whether sentence B comes after sentence A
- The resulting (trained) model weights are language-specific (here: english) and has to be further "fitted" to do a specific task (with fine-tuning).
- Heavy-weight: The training takes several hours or days.
Please follow the instructions provided in the Gaudi Installation Guide
to set up the environment including the $PYTHON
environment variable. To achieve the best performance, please follow the methods outlined in the Optimizing Training Platform Guide.
The guides will walk you through the process of setting up your system to run the model on Gaudi.
In the docker container, clone this repository and switch to the branch that
matches your Intel Gaudi software version. You can run the
hl-smi
utility to determine the Intel Gaudi software version.
git clone -b [Intel Gaudi software version] https://github.com/HabanaAI/Model-References /root/Model-References
Go to the deepspeed-bert directory:
cd /root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/
Note: If the repository is not in the PYTHONPATH, make sure to update by running the below.
export PYTHONPATH=/path/to/Model-References:$PYTHONPATH
In order to run this model, installing DeepSpeed library is required. Please follow the instruction provided in the DeepSpeed repository on a release branch matching the Intel Gaudi software version being used.
Install the required Python packages in the docker container:
pip install -r ./requirements.txt
In deepspeed-bert
directory, data
directory provides scripts
to download, extract and pre-process Wikipedia and BookCorpus datasets.
Go to data
directory and run the data preparation script:
cd ./data
It is recommended to download wiki dataset alone using the following command:
bash create_datasets_from_start.sh
Wiki and BookCorpus datasets can be downloaded by running the script as follows:
bash create_datasets_from_start.sh wiki_books
Note: The pre-training dataset is huge and takes several hours to download. BookCorpus may have access and download constraints. The final accuracy may vary depending on the dataset and its size. The script creates formatted dataset for the phase1 of pre-training.
The base training and modeling scripts for pre-training are based on a clone of BERT which is based on a clone of https://github.com/NVIDIA/DeepLearningExamples.
Make sure the host machine has 512 GB of RAM installed. Modify the docker run command to pass 8 Gaudi cards to the docker container. This ensures the docker has access to all the 8 cards required for multi-card pre-training.
Note: Ensure the DATA_DIR
variable in the run_bert_1.5b_8x.sh script contains the correct path to the input dataset.
Run pre-training for Phase 1 on 8 HPUs:
bash ./scripts/run_bert_1.5b_8x.sh
To run multi-server demo, make sure the host machine has 512 GB of RAM installed. Also ensure you followed the Gaudi Installation Guide to install and set up docker, so that the docker has access to all the 8 cards required for multi-server demo. Multi-server configuration for BERT PT training has been verified with up to 4 servers, each with 8 Gaudi cards.
Before execution of the multi-server demo scripts, make sure all network interfaces are up. You can change the state of each network interface managed by the habanalabs
driver using the following command:
sudo ip link set <interface_name> up
To identify if a specific network interface is managed by the habanalabs
driver type, run:
sudo ethtool -i <interface_name>
By default, the Intel Gaudi docker uses port 22
for ssh. The default port configured in the demo script is port 3022
. Run the following commands to configure the selected port number , port 3022
in example below.
sed -i 's/#Port 22/Port 3022/g' /etc/ssh/sshd_config
sed -i 's/# Port 22/ Port 3022/g' /etc/ssh/ssh_config
sed -i 's/#PermitRootLogin prohibit-password/PermitRootLogin yes/' /etc/ssh/sshd_config
service ssh restart
To set up password-less ssh between all connected servers used in scale-out training, follow the below steps:
-
Do the following in all the nodes' docker sessions:
mkdir ~/.ssh cd ~/.ssh ssh-keygen -t rsa -b 4096
a. Copy id_rsa.pub contents from every node's docker to every other node's docker's ~/.ssh/authorized_keys (all public keys need to be in all hosts' authorized_keys):
cat id_rsa.pub > authorized_keys vi authorized_keys
b. Copy the contents from inside to other systems.
c. Paste all hosts' public keys in all hosts' “authorized_keys” file.
-
On each system, add all hosts (including itself) to known_hosts. The IP addresses used below are just for illustration:
ssh-keyscan -p 3022 -H 10.10.100.101 >> ~/.ssh/known_hosts ssh-keyscan -p 3022 -H 10.10.100.102 >> ~/.ssh/known_hosts ssh-keyscan -p 3022 -H 10.10.100.103 >> ~/.ssh/known_hosts ssh-keyscan -p 3022 -H 10.10.100.104 >> ~/.ssh/known_hosts
-
Install the DeepSpeed library on all systems as mentioned in Install DeepSpeed Library.
-
Install the python packages required for BERT-1.5B pre-training on all systems:
pip install -r /root/Model-References/PyTorch/nlp/pretraining/deepspeed-bert/requirements.txt
-
Please review DeepSpeed documentation regarding multi-node training.
-
A hostfile should be created according to DeepSpeed requirements. Here the hostfile is present in scripts directory which should be edited with correct IP addresses of all hosts and respective cards.
-
DeepSpeed allows to create a ~/.deepspeed_env file to set environment variables by DeepSpeed across the hosts. Please refer to multi-node-environment-variables section.
-
It is recommended to review HCCL documentation.
Note: Ensure the
DATA_DIR
variable in the run_bert_5b_32x.sh script contains the correct path to the input dataset.
Run pre-training for Phase 1 on 32 HPUs:
bash ./scripts/run_bert_5b_32x_lans.sh
Below are the helper scripts for BERT-1.5B configuration and training:
- Model_Config: bert_1.5b_config.json
- DeepSpeed Config: deepspeed_config_bert_1.5b.json
- 8 card training helper script: run_bert_1.5b_8x.sh
- Hostfile: hostfile
Below are the helper scripts for BERT-5B configuration and training:
- Model_Config: bert_5b_config.json
- DeepSpeed Config: deepspeed_config_bert_5b_lans.json
- 32 card training helper script: run_bert_5b_32x_lans.sh
- Hostfile: hostfile
Validated on | Intel Gaudi Software Version | PyTorch Version | Mode |
---|---|---|---|
Gaudi 2 | 1.18.0 | 2.4.0 | Training |
- Forced static compilation for BERT Pretraining in torch.compile mode.
- move bash examples to use torch.compile
- change bert-5b to use x32 cards, disable checkpointing activation, and increase mbs to 32.
- Added support for profiling via --profile and --profile_steps script flag.
- Removed non-required WAs.
- Disabled Accumulation thread as a WA.
- Reduced Max HCCL comms to 1 for memory purposes.
- Re-used DeepSpeed engine data_parallel_group in the script.
- Added support for BS=8 in Bert-5B over x128 cards.
- Added support for BS=24 in Bert-5B with checkpoint-activation over x128 cards.
- Removed weight sharing WA for decoder and embeddings.
- Added mark_step before and after the loss for memory purposes for memory reasons.
- Added BWD hook that will mark_step for the BertLMPredictionHead layer for memory reasons.
- Modified Bert-5B example to run with checkpoint-activations and BS=24.
- Added support for distributed emulation mode.
- Improved memory consumption for Bert-5B model via environment flags.
- Fixed time_per_step metric to report time per model step instead of acc-step.
- Moved Bert-5B model specific workarounds to run_pretraining.py.
- Removed lazy_mode argument.
- Changed default optimizer of 1.5b model to LANS.
- Added script support for using DeepSpeed activation checkpointing.
- Added option for LANS optimizer.
- Improved checkpoint mechanism for better reproducibility.
- Added new model and DeepSpeed configuration files and script example for Bert-5B.
- Created this training script based on a clone of BERT version 1.4.0.
- Made adjustment to DeepSpeed engine.
- Added recommended DeepSpeed configuration files.
- Added new model configuration file for Bert-1.5B.
This section lists the training script modifications for the BERT-1.5B and BERT-5B.
- Wrapped model object with DeepSpeed wrapper.
- Call DeepSpeed's model.backward(loss) instead of loss.backward().
- Call DeepSpeed's model.step() instead of optimizer.step().
- Removed all gradient_accumulation handling from the training loop, as it is being handled by DeepSpeed engine.
- Added TensorLogger utility to enable accuracy debug.
- Removed user configuration from user script that from now will be configured through DeepSpeed.
- Enabled CUDA functionality in training script.
- Removed torch.distributed initialization from the script (also moved to DeepSpeed).
- Saved checkpoint flow should also go through DeepSpeed.
- Added new configuration flags to enable 1.5B training regime specification.
Placing mark_step() arbitrarily may lead to undefined behavior. Recommend to keep mark_step() as shown in provided scripts.