This repository has been archived by the owner on Oct 11, 2024. It is now read-only.
forked from vllm-project/vllm
-
Notifications
You must be signed in to change notification settings - Fork 10
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge branch 'main' into rel-eng/dial-in-accuracy-tests
- Loading branch information
Showing
41 changed files
with
847 additions
and
259 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,110 @@ | ||
.. _apc: | ||
|
||
Introduction | ||
============ | ||
|
||
What is Automatic Prefix Caching | ||
-------------------------------- | ||
|
||
Automatic Prefix Caching (APC in short) caches the KV cache of existing queries, so that a new query can directly reuse the KV cache if it shares the same prefix with one of the existing queries, allowing the new query to skip the computation of the shared part. | ||
|
||
|
||
.. note:: | ||
|
||
Technical details on how vLLM implements APC are in the next page. | ||
|
||
|
||
|
||
Enabling APC in vLLM | ||
-------------------- | ||
|
||
Set ``enable_prefix_caching=True`` in vLLM engine to enable APC. Here is an example: | ||
|
||
.. code-block:: python | ||
import time | ||
from vllm import LLM, SamplingParams | ||
# A prompt containing a large markdown table. The table is randomly generated by GPT-4. | ||
LONG_PROMPT = "You are a helpful assistant in recognizes the content of tables in markdown format. Here is a table as follows.\n# Table\n" + """ | ||
| ID | Name | Age | Occupation | Country | Email | Phone Number | Address | | ||
|-----|---------------|-----|---------------|---------------|------------------------|----------------|------------------------------| | ||
| 1 | John Doe | 29 | Engineer | USA | [email protected] | 555-1234 | 123 Elm St, Springfield, IL | | ||
| 2 | Jane Smith | 34 | Doctor | Canada | [email protected] | 555-5678 | 456 Oak St, Toronto, ON | | ||
| 3 | Alice Johnson | 27 | Teacher | UK | [email protected] | 555-8765 | 789 Pine St, London, UK | | ||
| 4 | Bob Brown | 45 | Artist | Australia | [email protected] | 555-4321 | 321 Maple St, Sydney, NSW | | ||
| 5 | Carol White | 31 | Scientist | New Zealand | [email protected] | 555-6789 | 654 Birch St, Wellington, NZ | | ||
| 6 | Dave Green | 28 | Lawyer | Ireland | [email protected] | 555-3456 | 987 Cedar St, Dublin, IE | | ||
| 7 | Emma Black | 40 | Musician | USA | [email protected] | 555-1111 | 246 Ash St, New York, NY | | ||
| 8 | Frank Blue | 37 | Chef | Canada | [email protected] | 555-2222 | 135 Spruce St, Vancouver, BC | | ||
| 9 | Grace Yellow | 50 | Engineer | UK | [email protected] | 555-3333 | 864 Fir St, Manchester, UK | | ||
| 10 | Henry Violet | 32 | Artist | Australia | [email protected] | 555-4444 | 753 Willow St, Melbourne, VIC| | ||
| 11 | Irene Orange | 26 | Scientist | New Zealand | [email protected] | 555-5555 | 912 Poplar St, Auckland, NZ | | ||
| 12 | Jack Indigo | 38 | Teacher | Ireland | [email protected] | 555-6666 | 159 Elm St, Cork, IE | | ||
| 13 | Karen Red | 41 | Lawyer | USA | [email protected] | 555-7777 | 357 Cedar St, Boston, MA | | ||
| 14 | Leo Brown | 30 | Chef | Canada | [email protected] | 555-8888 | 246 Oak St, Calgary, AB | | ||
| 15 | Mia Green | 33 | Musician | UK | [email protected] | 555-9999 | 975 Pine St, Edinburgh, UK | | ||
| 16 | Noah Yellow | 29 | Doctor | Australia | [email protected] | 555-0000 | 864 Birch St, Brisbane, QLD | | ||
| 17 | Olivia Blue | 35 | Engineer | New Zealand | [email protected] | 555-1212 | 753 Maple St, Hamilton, NZ | | ||
| 18 | Peter Black | 42 | Artist | Ireland | [email protected] | 555-3434 | 912 Fir St, Limerick, IE | | ||
| 19 | Quinn White | 28 | Scientist | USA | [email protected] | 555-5656 | 159 Willow St, Seattle, WA | | ||
| 20 | Rachel Red | 31 | Teacher | Canada | [email protected] | 555-7878 | 357 Poplar St, Ottawa, ON | | ||
| 21 | Steve Green | 44 | Lawyer | UK | [email protected] | 555-9090 | 753 Elm St, Birmingham, UK | | ||
| 22 | Tina Blue | 36 | Musician | Australia | [email protected] | 555-1213 | 864 Cedar St, Perth, WA | | ||
| 23 | Umar Black | 39 | Chef | New Zealand | [email protected] | 555-3435 | 975 Spruce St, Christchurch, NZ| | ||
| 24 | Victor Yellow | 43 | Engineer | Ireland | [email protected] | 555-5657 | 246 Willow St, Galway, IE | | ||
| 25 | Wendy Orange | 27 | Artist | USA | [email protected] | 555-7879 | 135 Elm St, Denver, CO | | ||
| 26 | Xavier Green | 34 | Scientist | Canada | [email protected] | 555-9091 | 357 Oak St, Montreal, QC | | ||
| 27 | Yara Red | 41 | Teacher | UK | [email protected] | 555-1214 | 975 Pine St, Leeds, UK | | ||
| 28 | Zack Blue | 30 | Lawyer | Australia | [email protected] | 555-3436 | 135 Birch St, Adelaide, SA | | ||
| 29 | Amy White | 33 | Musician | New Zealand | [email protected] | 555-5658 | 159 Maple St, Wellington, NZ | | ||
| 30 | Ben Black | 38 | Chef | Ireland | [email protected] | 555-7870 | 246 Fir St, Waterford, IE | | ||
""" | ||
def get_generation_time(llm, sampling_params, prompts): | ||
# time the generation | ||
start_time = time.time() | ||
output = llm.generate(prompts, sampling_params=sampling_params) | ||
end_time = time.time() | ||
# print the output and generation time | ||
print(f"Output: {output[0].outputs[0].text}") | ||
print(f"Generation time: {end_time - start_time} seconds.") | ||
# set enable_prefix_caching=True to enable APC | ||
llm = LLM( | ||
model='lmsys/longchat-13b-16k', | ||
enable_prefix_caching=True | ||
) | ||
sampling_params = SamplingParams(temperature=0, max_tokens=100) | ||
# Querying the age of John Doe | ||
get_generation_time( | ||
llm, | ||
sampling_params, | ||
LONG_PROMPT + "Question: what is the age of John Doe? Your answer: The age of John Doe is ", | ||
) | ||
# Querying the age of Zack Blue | ||
# This query will be faster since vllm avoids computing the KV cache of LONG_PROMPT again. | ||
get_generation_time( | ||
llm, | ||
sampling_params, | ||
LONG_PROMPT + "Question: what is the age of Zack Blue? Your answer: The age of Zack Blue is ", | ||
) | ||
Example workloads | ||
----------------- | ||
|
||
We describe two example workloads, where APC can provide huge performance benefit: | ||
|
||
- Long document query, where the user repeatedly queries the same long document (e.g. software manual or annual report) with different queries. In this case, instead of processing the long document again and again, APC allows vLLM to process this long document *only once*, and all future requests can avoid recomputing this long document by reusing its KV cache. This allows vLLM to serve future requests with much higher throughput and much lower latency. | ||
- Multi-round conversation, where the user may chat with the application multiple times in the same chatting session. In this case, instead of processing the whole chatting history again and again, APC allows vLLM to reuse the processing results of the chat history across all future rounds of conversation, allowing vLLM to serve future requests with much higher throughput and much lower latency. | ||
|
||
|
||
Limits | ||
------ | ||
APC in general does not reduce the performance of vLLM. With that being said, APC only reduces the time of processing the queries (the prefilling phase) and does not reduce the time of generating new tokens (the decoding phase). So APC does not bring performance gain when vLLM spends most of the time generating answers to the queries (e.g. when the length of the answer is long), or new queries do not share the same prefix with any of existing queries (so that the computation cannot be reused). |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,43 @@ | ||
# Implementation | ||
|
||
The core idea of PagedAttention is to partition the KV cache of each request into KV Blocks. Each block contains the attention keys and values for a fixed number of tokens. The PagedAttention algorithm allows these blocks to be stored in non-contiguous physical memory so that we can eliminate memory fragmentation by allocating the memory on demand. | ||
|
||
To automatically cache the KV cache, we utilize the following key observation: Each KV block can be uniquely identified by the tokens within the block and the tokens in the prefix before the block. | ||
|
||
``` | ||
Block 1 Block 2 Block 3 | ||
[A gentle breeze stirred] [the leaves as children] [laughed in the distance] | ||
Block 1: |<--- block tokens ---->| | ||
Block 2: |<------- prefix ------>| |<--- block tokens --->| | ||
Block 3: |<------------------ prefix -------------------->| |<--- block tokens ---->| | ||
``` | ||
|
||
|
||
In the example above, the KV cache in the first block can be uniquely identified with the tokens “A gentle breeze stirred”. The third block can be uniquely identified with the tokens in the block “laughed in the distance”, along with the prefix tokens “A gentle breeze stirred the leaves as children”. Therefore, we can build the following one-to-one mapping: | ||
|
||
``` | ||
hash(prefix tokens + block tokens) <--> KV Block | ||
``` | ||
|
||
With this mapping, we can add another indirection in vLLM’s KV cache management. Previously, each sequence in vLLM maintained a mapping from their logical KV blocks to physical blocks. To achieve automatic caching of KV blocks, we map the logical KV blocks to their hash value and maintain a global hash table of all the physical blocks. In this way, all the KV blocks sharing the same hash value (e.g., shared prefix blocks across two requests) can be mapped to the same physical block and share the memory space. | ||
|
||
|
||
This design achieves automatic prefix caching without the need of maintaining a tree structure among the KV blocks. More specifically, all of the blocks are independent of each other and can be allocated and freed by itself, which enables us to manages the KV cache as ordinary caches in operating system. | ||
|
||
|
||
# Generalized Caching Policy | ||
|
||
Keeping all the KV blocks in a hash table enables vLLM to cache KV blocks from earlier requests to save memory and accelerate the computation of future requests. For example, if a new request shares the system prompt with the previous request, the KV cache of the shared prompt can directly be used for the new request without recomputation. However, the total KV cache space is limited and we have to decide which KV blocks to keep or evict when the cache is full. | ||
|
||
Managing KV cache with a hash table allows us to implement flexible caching policies. As an example, in current vLLM, we implement the following eviction policy: | ||
|
||
* When there are no free blocks left, we will evict a KV block with reference count (i.e., number of current requests using the block) equals 0. | ||
* If there are multiple blocks with reference count equals to 0, we prioritize to evict the least recently used block (LRU). | ||
* If there are multiple blocks whose last access time are the same, we prioritize the eviction of the block that is at the end of the longest prefix (i.e., has the maximum number of blocks before it). | ||
|
||
Note that this eviction policy effectively implements the exact policy as in [RadixAttention](https://lmsys.org/blog/2024-01-17-sglang/) when applied to models with full attention, which prioritizes to evict reference count zero and least recent used leaf nodes in the prefix tree. | ||
|
||
However, the hash-based KV cache management gives us the flexibility to handle more complicated serving scenarios and implement more complicated eviction policies beyond the policy above: | ||
|
||
- Multi-LoRA serving. When serving requests for multiple LoRA adapters, we can simply let the hash of each KV block to also include the LoRA ID the request is querying for to enable caching for all adapters. In this way, we can jointly manage the KV blocks for different adapters, which simplifies the system implementation and improves the global cache hit rate and efficiency. | ||
- Multi-modal models. When the user input includes more than just discrete tokens, we can use different hashing methods to handle the caching of inputs of different modalities. For example, perceptual hashing for images to cache similar input images. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
|
@@ -92,7 +92,7 @@ def setup(app): | |
"vllm._C", | ||
"PIL", | ||
"numpy", | ||
'triton' | ||
'triton', | ||
"tqdm", | ||
"tensorizer", | ||
] | ||
|
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,42 @@ | ||
.. _debugging: | ||
|
||
Debugging Tips | ||
=============== | ||
|
||
Debugging hang/crash issues | ||
--------------------------- | ||
|
||
When an vLLM instance hangs or crashes, it is very difficult to debug the issue. But wait a minute, it is also possible that vLLM is doing something that indeed takes a long time: | ||
|
||
- Downloading a model: do you have the model already downloaded in your disk? If not, vLLM will download the model from the internet, which can take a long time. Be sure to check the internet connection. It would be better to download the model first using `huggingface cli <https://huggingface.co/docs/huggingface_hub/en/guides/cli>`_ and then use the local path to the model. This way, you can isolate the issue. | ||
- Loading the model from disk: if the model is large, it can take a long time to load the model from disk. Please take care of the location you store the model. Some clusters have shared filesystems across nodes, e.g. distributed filesystem or network filesystem, which can be slow. It would be better to store the model in a local disk. In addition, please also watch the CPU memory usage. When the model is too large, it might take much CPU memory, which can slow down the operating system because it needs to frequently swap memory between the disk and the memory. | ||
- Tensor parallel inference: if the model is too large to fit in a single GPU, you might want to use tensor parallelism to split the model across multiple GPUs. In that case, every process will read the whole model and split it into chunks, which makes the disk reading time even longer (proportional to the size of tensor parallelism). You can convert the model checkpoint to a sharded checkpoint using `the provided script <https://docs.vllm.ai/en/latest/getting_started/examples/save_sharded_state.html>`_ . The conversion process might take some time, but later you can load the sharded checkpoint much faster. The model loading time should remain constant regardless of the size of tensor parallelism. | ||
|
||
If you already take care of the above issues, and the vLLM instance still hangs, with CPU and GPU utilization at near zero, it is likely that the vLLM instance is stuck somewhere. Here are some tips to help debug the issue: | ||
|
||
- Set the environment variable ``export VLLM_LOGGING_LEVEL=DEBUG`` to turn on more logging. | ||
- Set the environment variable ``export CUDA_LAUNCH_BLOCKING=1`` to know exactly which CUDA kernel is causing the trouble. | ||
- Set the environment variable ``export NCCL_DEBUG=TRACE`` to turn on more logging for NCCL. | ||
- Set the environment variable ``export VLLM_TRACE_FUNCTION=1`` . All the function calls in vLLM will be recorded. Inspect these log files, and tell which function crashes or hangs. **Note: it will generate a lot of logs and slow down the system. Only use it for debugging purposes.** | ||
|
||
With more logging, hopefully you can find the root cause of the issue. | ||
|
||
Here are some common issues that can cause hangs: | ||
|
||
- The network setup is incorrect. The vLLM instance cannot get the correct IP address. You can find the log such as ``DEBUG 06-10 21:32:17 parallel_state.py:88] world_size=8 rank=0 local_rank=0 distributed_init_method=tcp://xxx.xxx.xxx.xxx:54641 backend=nccl``. The IP address should be the correct one. If not, override the IP address by setting the environment variable ``export VLLM_HOST_IP=your_ip_address``. | ||
- Hardware/driver setup is incorrect. GPU communication cannot be established. You can run a sanity check script below to see if the GPU communication is working correctly. | ||
|
||
.. code-block:: python | ||
# save it as `test.py`` , and run it with `NCCL_DEBUG=TRACE torchrun --nproc-per-node=8 test.py` | ||
# adjust `--nproc-per-node` to the number of GPUs you want to use. | ||
import torch | ||
import torch.distributed as dist | ||
dist.init_process_group(backend="nccl") | ||
data = torch.FloatTensor([1,] * 128).to(f"cuda:{dist.get_rank()}") | ||
dist.all_reduce(data, op=dist.ReduceOp.SUM) | ||
torch.cuda.synchronize() | ||
value = data.mean().item() | ||
assert value == dist.get_world_size() | ||
If the problem persists, feel free to open an `issue <https://github.com/vllm-project/vllm/issues/new/choose>`_ on GitHub, with a detailed description of the issue, your environment, and the logs. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
bebf6d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger_is_better
{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.9.17 (main, Jun 7 2023, 12:34:12) \n[GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
2.422068682094795
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.9.17 (main, Jun 7 2023, 12:34:12) \n[GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
930.0743739244014
tokens/sThis comment was automatically generated by workflow using github-action-benchmark.
bebf6d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger_is_better
{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.11.4 (main, Jun 7 2023, 11:01:02) [GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
2.467894055397205
prompts/s2.463883529113595
prompts/s1.00
{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.11.4 (main, Jun 7 2023, 11:01:02) [GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
947.6713172725267
tokens/s946.1312751796206
tokens/s1.00
This comment was automatically generated by workflow using github-action-benchmark.
bebf6d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger_is_better
{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.8.17 (default, Jun 7 2023, 12:29:56) \n[GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
2.4378652438292274
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.8.17 (default, Jun 7 2023, 12:29:56) \n[GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
936.1402536304234
tokens/sThis comment was automatically generated by workflow using github-action-benchmark.
bebf6d3
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
bigger_is_better
{"name": "request_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.10.12 (main, Jun 7 2023, 13:43:11) [GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
2.4218898495262287
prompts/s{"name": "token_throughput", "description": "VLLM Engine throughput - synthetic\nmodel - NousResearch/Llama-2-7b-chat-hf\nmax_model_len - 4096\nbenchmark_throughput {\n \"use-all-available-gpus_\": \"\",\n \"input-len\": 256,\n \"output-len\": 128,\n \"num-prompts\": 1000\n}", "gpu_description": "NVIDIA L4 x 1", "vllm_version": "0.5.0", "python_version": "3.10.12 (main, Jun 7 2023, 13:43:11) [GCC 11.3.0]", "torch_version": "2.3.0+cu121"}
930.0057022180717
tokens/sThis comment was automatically generated by workflow using github-action-benchmark.