The Hugging Face PyTorch Training Containers are Docker containers for training Hugging Face models on Google Cloud AI Platform. There are two containers depending on which accelerator is used, that is GPU and TPU at the moment. The containers come with all the necessary dependencies to train Hugging Face models on Google Cloud AI Platform.
Note
These containers are named PyTorch containers since PyTorch is the backend framework used for training the models; but it comes with all the required Hugging Face libraries installed.
To check which of the available Hugging Face DLCs are published, you can either check the Google Cloud Artifact Registry or use the gcloud
command to list the available containers with the tag huggingface-pytorch-training
as follows:
gcloud container images list --repository="us-docker.pkg.dev/deeplearning-platform-release/gcr.io" | grep "huggingface-pytorch-training"
Below you will find the instructions on how to run the PyTorch Training containers available within this repository. Note that before proceeding you need to first ensure that you have Docker installed either on your local or remote instance, if not, please follow the instructions on how to install Docker here.
Additionally, if you're willing to run the Docker container in GPUs you will need to install the NVIDIA Container Toolkit.
The PyTorch Training containers will start a training job that will start on docker run
and will be closed whenever the training job finishes. As the container is offered for both accelerators GPU and TPU, the examples below are provided.
-
GPU: This example showcases how to fine-tune an LLM via
trl
on a GPU instance using the PyTorch Training container, as it comes withtrl
installed.docker run --gpus all -ti \ -v $(pwd)/artifact:/artifact \ -e HF_TOKEN=$(cat ~/.cache/huggingface/token) \ us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 \ trl sft \ --model_name_or_path google/gemma-2b \ --attn_implementation "flash_attention_2" \ --torch_dtype "bfloat16" \ --dataset_name OpenAssistant/oasst_top1_2023-08-25 \ --dataset_text_field "text" \ --max_steps 100 \ --logging_steps 10 \ --bf16 True \ --per_device_train_batch_size 4 \ --use_peft True \ --load_in_4bit True \ --output_dir /artifacts
[!NOTE] For a more detailed explanation and a diverse set of examples, please check the ./examples directory that contains examples on both Google Kubernetes Engine (GKE) and Google Vertex AI.
-
TPU: This example showcases how to deploy a Jupyter Notebook Server from a TPU instance (such as
v5litepod-8
) using the PyTorch Training container, as it comes withoptimum-tpu
installed; so that then you can import a Jupyter Notebook from the ones defined within theopitimum-tpu
repository or just reuse the Jupyter Notebook that comes within the PyTorch Training container i.e.gemma-tuning.ipynb
; and then just run it.docker run --rm --net host --privileged \ -v$(pwd)/artifact:/notebooks/output \ us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.4.0.transformers.4.41.1.py310 \ jupyter notebook \ --port 8888 \ --allow-root \ --no-browser \ notebooks
[!NOTE] Find more detailed examples on TPU fine-tuning in the
optimum-tpu
repository.
Warning
Building the containers is not recommended since those are already built by Hugging Face and Google Cloud teams and provided openly, so the recommended approach is to use the pre-built containers available in Google Cloud's Artifact Registry instead.
The PyTorch Training containers come with two different containers depending on the accelerator used for training, being either GPU or TPU, those have different constraints when building the Docker image as described below:
-
GPU: To build the PyTorch Training container for GPU, an instance with at least one NVIDIA GPU available is required to install
flash-attn
(used to speed up the attention layers during training and inference).docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-cu121.2-3.transformers.4-42.ubuntu2204.py310 -f containers/pytorch/training/gpu/2.3.0/transformers/4.42.3/py310/Dockerfile .
-
TPU: To build the PyTorch Training container for Google Cloud TPUs, an instance with at least one TPU available is required to install
optimum-tpu
which is a Python library with Google TPU optimizations fortransformers
models, making its integration seamless.docker build -t us-docker.pkg.dev/deeplearning-platform-release/gcr.io/huggingface-pytorch-training-tpu.2.4.0.transformers.4.41.1.py310 -f containers/pytorch/training/tpu/2.4.0/transformers/4.41.1/py310/Dockerfile .