forked from ROCm/ROCm
-
Notifications
You must be signed in to change notification settings - Fork 0
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request ROCm#3217 from peterjunpark/docs/6.1.1
docs/6.1.1: Add "Fine Tuning LLMs" how to guide (ROCm#3124)
- Loading branch information
Showing
32 changed files
with
2,763 additions
and
2 deletions.
There are no files selected for viewing
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
20 changes: 20 additions & 0 deletions
20
docs/how-to/fine-tuning-llms/fine-tuning-and-inference.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,20 @@ | ||
.. meta:: | ||
:description: How to fine-tune LLMs with ROCm | ||
:keywords: ROCm, LLM, fine-tuning, inference, usage, tutorial | ||
|
||
************************* | ||
Fine-tuning and inference | ||
************************* | ||
|
||
Fine-tuning using ROCm involves leveraging AMD's GPU-accelerated :doc:`libraries <rocm:reference/api-libraries>` and | ||
:doc:`tools <rocm:reference/rocm-tools>` to optimize and train deep learning models. ROCm provides a comprehensive | ||
ecosystem for deep learning development, including open-source libraries for optimized deep learning operations and | ||
ROCm-aware versions of :doc:`deep learning frameworks <../deep-learning-rocm>` such as PyTorch, TensorFlow, and JAX. | ||
|
||
Single-accelerator systems, such as a machine equipped with a single accelerator or GPU, are commonly used for | ||
smaller-scale deep learning tasks, including fine-tuning pre-trained models and running inference on moderately | ||
sized datasets. See :doc:`single-gpu-fine-tuning-and-inference`. | ||
|
||
Multi-accelerator systems, on the other hand, consist of multiple accelerators working in parallel. These systems are | ||
typically used in LLMs and other large-scale deep learning tasks where performance, scalability, and the handling of | ||
massive datasets are crucial. See :doc:`multi-gpu-fine-tuning-and-inference`. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
.. meta:: | ||
:description: How to fine-tune LLMs with ROCm | ||
:keywords: ROCm, LLM, fine-tuning, usage, tutorial | ||
|
||
************************** | ||
Fine-tuning LLMs with ROCm | ||
************************** | ||
|
||
ROCm empowers the fine-tuning and optimization of large language models, making them accessible and efficient for | ||
specialized tasks. ROCm supports the broader AI ecosystem to ensure seamless integration with open frameworks, | ||
models, and tools. | ||
|
||
For more information, see `What is ROCm? <https://rocm.docs.amd.com/en/latest/what-is-rocm.html>`_ | ||
|
||
Throughout the following topics, this guide discusses the goals and :ref:`challenges of fine-tuning a large language | ||
model <fine-tuning-llms-concept-challenge>` like Llama 2. Then, it introduces :ref:`common methods of optimizing your | ||
fine-tuning <fine-tuning-llms-concept-optimizations>` using techniques like LoRA with libraries like PEFT. In the | ||
sections that follow, you'll find practical guides on libraries and tools to accelerate your fine-tuning. | ||
|
||
- :doc:`Conceptual overview of fine-tuning LLMs <overview>` | ||
|
||
- :doc:`Fine-tuning and inference <fine-tuning-and-inference>` using a | ||
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` or | ||
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` system. | ||
|
||
- :doc:`Model quantization <model-quantization>` | ||
|
||
- :doc:`Model acceleration libraries <model-acceleration-libraries>` | ||
|
||
- :doc:`LLM inference frameworks <llm-inference-frameworks>` | ||
|
||
- :doc:`Optimizing with Composable Kernel <optimizing-with-composable-kernel>` | ||
|
||
- :doc:`Optimizing Triton kernels <optimizing-triton-kernel>` | ||
|
||
- :doc:`Profiling and debugging <profiling-and-debugging>` | ||
|
218 changes: 218 additions & 0 deletions
218
docs/how-to/fine-tuning-llms/llm-inference-frameworks.rst
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,218 @@ | ||
.. meta:: | ||
:description: How to fine-tune LLMs with ROCm | ||
:keywords: ROCm, LLM, fine-tuning, usage, tutorial, inference, vLLM, TGI, text generation inference | ||
|
||
************************ | ||
LLM inference frameworks | ||
************************ | ||
|
||
This section discusses how to implement `vLLM <https://docs.vllm.ai/en/latest>`_ and `Hugging Face TGI | ||
<https://huggingface.co/docs/text-generation-inference/en/index>`_ using | ||
:doc:`single-accelerator <single-gpu-fine-tuning-and-inference>` and | ||
:doc:`multi-accelerator <multi-gpu-fine-tuning-and-inference>` systems. | ||
|
||
.. _fine-tuning-llms-vllm: | ||
|
||
vLLM inference | ||
============== | ||
|
||
vLLM is renowned for its paged attention algorithm that can reduce memory consumption and increase throughput thanks to | ||
its paging scheme. Instead of allocating GPU high-bandwidth memory (HBM) for the maximum output token lengths of the | ||
models, the paged attention of vLLM allocates GPU HBM dynamically for its actual decoding lengths. This paged attention | ||
is also effective when multiple requests share the same key and value contents for a large value of beam search or | ||
multiple parallel requests. | ||
|
||
vLLM also incorporates many modern LLM acceleration and quantization algorithms, such as Flash Attention, HIP and CUDA | ||
graphs, tensor parallel multi-GPU, GPTQ, AWQ, and token speculation. | ||
|
||
Installing vLLM | ||
--------------- | ||
|
||
1. To install vLLM, run the following commands. | ||
|
||
.. code-block:: shell | ||
# Install from the source | ||
git clone https://github.com/ROCm/vllm.git | ||
cd vllm | ||
PYTORCH_ROCM_ARCH=gfx942 python setup.py install #MI300 series | ||
.. _fine-tuning-llms-vllm-rocm-docker-image: | ||
|
||
2. Run the following commands to build a Docker image ``vllm-rocm``. | ||
|
||
.. code-block:: shell | ||
git clone https://github.com/vllm-project/vllm.git | ||
cd vllm | ||
docker build -f Dockerfile.rocm -t vllm-rocm . | ||
.. tab-set:: | ||
|
||
.. tab-item:: vLLM on a single-accelerator system | ||
:sync: single | ||
|
||
3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm | ||
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`. | ||
|
||
.. code-block:: shell | ||
docker run -it \ | ||
--network=host \ | ||
--group-add=video \ | ||
--ipc=host \ | ||
--cap-add=SYS_PTRACE \ | ||
--security-opt seccomp=unconfined \ | ||
--device /dev/kfd \ | ||
--device /dev/dri \ | ||
-v <path/to/model>:/app/model \ | ||
vllm-rocm \ | ||
bash | ||
4. Inside the container, start the API server to run on a single accelerator on port 8000 using the following command. | ||
|
||
.. code-block:: shell | ||
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 --port 8000 & | ||
The following log message is displayed in your command line indicates that the server is listening for requests. | ||
|
||
.. image:: ../../data/how-to/fine-tuning-llms/vllm-single-gpu-log.png | ||
:alt: vLLM API server log message | ||
:align: center | ||
|
||
5. To test, send it a curl request containing a prompt. | ||
|
||
.. code-block:: shell | ||
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }' | ||
You should receive a response like the following. | ||
|
||
.. code-block:: text | ||
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]} | ||
.. tab-item:: vLLM on a multi-accelerator system | ||
:sync: multi | ||
|
||
3. To use vLLM as an API server to serve reference requests, first start a container using the :ref:`vllm-rocm | ||
Docker image <fine-tuning-llms-vllm-rocm-docker-image>`. | ||
|
||
.. code-block:: shell | ||
docker run -it \ | ||
--network=host \ | ||
--group-add=video \ | ||
--ipc=host \ | ||
--cap-add=SYS_PTRACE \ | ||
--security-opt seccomp=unconfined \ | ||
--device /dev/kfd \ | ||
--device /dev/dri \ | ||
-v <path/to/model>:/app/model \ | ||
vllm-rocm \ | ||
bash | ||
4. To run API server on multiple GPUs, use the ``-tp`` or ``--tensor-parallel-size`` parameter. For example, to use two | ||
GPUs, start the API server using the following command. | ||
|
||
.. code-block:: shell | ||
python -m vllm.entrypoints.api_server --model /app/model --dtype float16 -tp 2 --port 8000 & | ||
5. To run multiple instances of API Servers, specify different ports for each server, and use ``ROCR_VISIBLE_DEVICES`` to | ||
isolate each instance to a different accelerator. | ||
|
||
For example, to run two API servers, one on port 8000 using GPU 0 and 1, one on port 8001 using GPU 2 and 3, use a | ||
a command like the following. | ||
|
||
.. code-block:: shell | ||
ROCR_VISIBLE_DEVICES=0,1 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2 --port 8000 & | ||
ROCR_VISIBLE_DEVICES=2,3 python -m vllm.entrypoints.api_server --model /data/llama-2-7b-chat-hf --dtype float16 –tp 2--port 8001 & | ||
6. To test, send it a curl request containing a prompt. | ||
|
||
.. code-block:: shell | ||
curl http://localhost:8000/generate -H "Content-Type: application/json" -d '{"prompt": "What is AMD Instinct?", "max_tokens": 80, "temperature": 0.0 }' | ||
You should receive a response like the following. | ||
|
||
.. code-block:: text | ||
{"text":["What is AMD Instinct?\nAmd Instinct is a brand new line of high-performance computing (HPC) processors from Advanced Micro Devices (AMD). These processors are designed to deliver unparalleled performance for HPC workloads, including scientific simulations, data analytics, and machine learning.\nThe Instinct lineup includes a range of processors, from the entry-level Inst"]} | ||
.. _fine-tuning-llms-tgi: | ||
|
||
Hugging Face TGI | ||
================ | ||
|
||
Text Generation Inference (TGI) is LLM serving framework from Hugging | ||
Face, and it also supports the majority of high-performance LLM | ||
acceleration algorithms such as Flash Attention, Paged Attention, | ||
CUDA/HIP graph, tensor parallel multi-GPU, GPTQ, AWQ, and token | ||
speculation. | ||
|
||
.. tip:: | ||
|
||
In addition to LLM serving capability, TGI also provides the `Text Generation Inference benchmarking tool | ||
<https://github.com/huggingface/text-generation-inference/blob/main/benchmark/README.md>`_. | ||
|
||
Install TGI | ||
----------- | ||
|
||
1. To install the TGI Docker image, run the following commands. | ||
|
||
.. code-block:: shell | ||
# Install from Dockerfile | ||
git clone https://github.com/huggingface/text-generation-inference.git -b mi300-compat | ||
cd text-generation-inference | ||
docker build . -f Dockerfile.rocm | ||
.. tab-set:: | ||
|
||
.. tab-item:: TGI on a single-accelerator system | ||
:sync: single | ||
|
||
2. Launch a model using TGI server on a single accelerator. | ||
|
||
.. code-block:: shell | ||
export ROCM_USE_FLASH_ATTN_V2_TRITON=True | ||
text-generation-launcher --model-id NousResearch/Meta-Llama-3-70B --dtype float16 --port 8000 & | ||
3. To test, send it a curl request containing a prompt. | ||
|
||
.. code-block:: shell | ||
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' | ||
You should receive a response like the following. | ||
|
||
.. code-block:: shell | ||
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2822266,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null} | ||
.. tab-item:: TGI on a multi-accelerator system | ||
|
||
2. Launch a model using TGI server on multiple accelerators (4 in this case). | ||
|
||
.. code-block:: shell | ||
export ROCM_USE_FLASH_ATTN_V2_TRITON=True | ||
text-generation-launcher --model-id NousResearch/Meta-Llama-3-8B --dtype float16 --port 8000 --num-shard 4 & | ||
3. To test, send it a curl request containing a prompt. | ||
|
||
.. code-block:: shell | ||
curl http://localhost:8000/generate_stream -X POST -d '{"inputs":"What is AMD Instinct?","parameters":{"max_new_tokens":20}}' -H 'Content-Type: application/json' | ||
You should receive a response like the following. | ||
|
||
.. code-block:: shell | ||
data:{"index":20,"token":{"id":304,"text":" in","logprob":-1.2773438,"special":false},"generated_text":" AMD Instinct is a new family of data center GPUs designed to accelerate the most demanding workloads in","details":null} |
Oops, something went wrong.