Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

Commit

Permalink
Upstream sync 2024 06 08 (#288)
Browse files Browse the repository at this point in the history
Upstream sync 2024 06 08 (#288) - ties to v0.4.3 of vllm-upstream

SUMMARY:
* Merge commits from
vllm-project@f68470e
to
vllm-project@1197e02
* Our GCP test instances do not have `gcc` or `clang` installed. All of
the triton kernels rely on the `gcc` and `clang` to generate JITs. I
disabled these for now, but we need to get these installed (cc
@andy-neuma). All are marked with:

```python
@pytest.mark.skip("C compiler not installed in NM automation. "
                  "This codepath follows a triton pathway, which "
                  "JITs using clang or gcc. Since neither are installed "
                  "in our test instances, we need to skip this for now.")
```
* Cherry-picked in the changes associated with Fp8 weight format from
@mgoin

Note that
vllm-project@f68470e
is NOT included in this merge.

COMPARE vs UPSTREAM:
*
https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-08..vllm-project:vllm:v0.4.3

---------

Signed-off-by: kerthcet <[email protected]>
Signed-off-by: Muralidhar Andoorveedu <[email protected]>
Signed-off-by: pandyamarut <[email protected]>
Co-authored-by: Alexander Matveev <[email protected]>
Co-authored-by: Woosuk Kwon <[email protected]>
Co-authored-by: Cyrus Leung <[email protected]>
Co-authored-by: Wenwei Zhang <[email protected]>
Co-authored-by: Alexei-V-Ivanov-AMD <[email protected]>
Co-authored-by: Alexey Kondratiev <[email protected]>
Co-authored-by: Mor Zusman <[email protected]>
Co-authored-by: Mor Zusman <[email protected]>
Co-authored-by: Aurick Qiao <[email protected]>
Co-authored-by: Kuntai Du <[email protected]>
Co-authored-by: Antoni Baum <[email protected]>
Co-authored-by: HUANG Fei <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Simon Mo <[email protected]>
Co-authored-by: Michael Goin <[email protected]>
Co-authored-by: Kante Yin <[email protected]>
Co-authored-by: sasha0552 <[email protected]>
Co-authored-by: SangBin Cho <[email protected]>
Co-authored-by: Tyler Michael Smith <[email protected]>
Co-authored-by: Cody Yu <[email protected]>
Co-authored-by: raywanb <[email protected]>
Co-authored-by: Nick Hill <[email protected]>
Co-authored-by: Philipp Moritz <[email protected]>
Co-authored-by: Letian Li <[email protected]>
Co-authored-by: Murali Andoorveedu <[email protected]>
Co-authored-by: Dipika Sikka <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Varun Sundar Rabindranath <[email protected]>
Co-authored-by: Elisei Smirnov <[email protected]>
Co-authored-by: Elisei Smirnov <[email protected]>
Co-authored-by: youkaichao <[email protected]>
Co-authored-by: leiwen83 <[email protected]>
Co-authored-by: Lei Wen <[email protected]>
Co-authored-by: Eric Xihui Lin <[email protected]>
Co-authored-by: beagleski <[email protected]>
Co-authored-by: bapatra <[email protected]>
Co-authored-by: Barun Patra <[email protected]>
Co-authored-by: Lily Liu <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Zhuohan Li <[email protected]>
Co-authored-by: Isotr0py <[email protected]>
Co-authored-by: Michał Moskal <[email protected]>
Co-authored-by: Ruth Evans <[email protected]>
Co-authored-by: Divakar Verma <[email protected]>
Co-authored-by: Roger Wang <[email protected]>
Co-authored-by: Junichi Sato <[email protected]>
Co-authored-by: Marut Pandya <[email protected]>
Co-authored-by: afeldman-nm <[email protected]>
Co-authored-by: Ronen Schaffer <[email protected]>
Co-authored-by: Itay Etelis <[email protected]>
Co-authored-by: omkar kakarparthi <[email protected]>
Co-authored-by: Alexei V. Ivanov <[email protected]>
Co-authored-by: Breno Faria <[email protected]>
Co-authored-by: Breno Faria <[email protected]>
Co-authored-by: Hyunsung Lee <[email protected]>
Co-authored-by: Chansung Park <[email protected]>
Co-authored-by: SnowDist <[email protected]>
Co-authored-by: functionxu123 <[email protected]>
Co-authored-by: xuhao <[email protected]>
  • Loading branch information
Show file tree
Hide file tree
Showing 290 changed files with 17,573 additions and 9,003 deletions.
2 changes: 1 addition & 1 deletion .buildkite/check-wheel-size.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,7 @@
import os
import zipfile

MAX_SIZE_MB = 150
MAX_SIZE_MB = 200


def print_top_10_largest_files(zip_file):
Expand Down
28 changes: 28 additions & 0 deletions .buildkite/run-amd-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -5,6 +5,34 @@ set -ex
echo "--- ROCm info"
rocminfo

# cleanup older docker images
cleanup_docker() {
# Get Docker's root directory
docker_root=$(docker info -f '{{.DockerRootDir}}')
if [ -z "$docker_root" ]; then
echo "Failed to determine Docker root directory."
exit 1
fi
echo "Docker root directory: $docker_root"
# Check disk usage of the filesystem where Docker's root directory is located
disk_usage=$(df "$docker_root" | tail -1 | awk '{print $5}' | sed 's/%//')
# Define the threshold
threshold=70
if [ "$disk_usage" -gt "$threshold" ]; then
echo "Disk usage is above $threshold%. Cleaning up Docker images and volumes..."
# Remove dangling images (those that are not tagged and not used by any container)
docker image prune -f
# Remove unused volumes
docker volume prune -f
echo "Docker images and volumes cleanup completed."
else
echo "Disk usage is below $threshold%. No cleanup needed."
fi
}

# Call the cleanup docker function
cleanup_docker

echo "--- Resetting GPUs"

echo "reset" > /opt/amdgpu/etc/gpu_state
Expand Down
2 changes: 1 addition & 1 deletion .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -11,4 +11,4 @@ trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_CPU_KVCACHE_SPACE=1 --name cpu-test cpu-test python3 examples/offline_inference.py
docker run --network host --env VLLM_CPU_KVCACHE_SPACE=1 --name cpu-test cpu-test python3 vllm/examples/offline_inference.py
13 changes: 8 additions & 5 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,7 +37,6 @@ steps:
working_dir: "/vllm-workspace/tests"
num_gpus: 2
commands:
- pytest -v -s distributed/test_pynccl_library.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=ray pytest -v -s distributed/test_chunked_prefill_distributed.py
Expand All @@ -60,11 +59,12 @@ steps:
command: pytest -v -s engine tokenization test_sequence.py test_config.py test_logger.py

- label: Entrypoints Test
#mirror_hardwares: [amd]
mirror_hardwares: [amd]

commands:
# these tests have to be separated, because each one will allocate all posible GPU memory
- pytest -v -s entrypoints --ignore=entrypoints/test_server_oot_registration.py
- pytest -v -s entrypoints/test_server_oot_registration.py
- pytest -v -s test_inputs.py
- pytest -v -s entrypoints -m llm
- pytest -v -s entrypoints -m openai

- label: Examples Test
working_dir: "/vllm-workspace/examples"
Expand Down Expand Up @@ -109,6 +109,9 @@ steps:
mirror_hardwares: [amd]
command: pytest -v -s test_logits_processor.py

- label: Utils Test
command: pytest -v -s test_utils.py

- label: Worker Test
mirror_hardwares: [amd]
command: pytest -v -s worker
Expand Down
26 changes: 26 additions & 0 deletions .clang-format
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
BasedOnStyle: Google
UseTab: Never
IndentWidth: 2
ColumnLimit: 80

# Force pointers to the type for C++.
DerivePointerAlignment: false
PointerAlignment: Left

# Reordering #include statements can (and currently will) introduce errors
SortIncludes: false

# Style choices
AlignConsecutiveAssignments: false
AlignConsecutiveDeclarations: false
IndentPPDirectives: BeforeHash

IncludeCategories:
- Regex: '^<'
Priority: 4
- Regex: '^"(llvm|llvm-c|clang|clang-c|mlir|mlir-c)/'
Priority: 3
- Regex: '^"(qoda|\.\.)/'
Priority: 2
- Regex: '.*'
Priority: 1
2 changes: 2 additions & 0 deletions .github/ISSUE_TEMPLATE/400-bug report.yml
Original file line number Diff line number Diff line change
Expand Up @@ -59,6 +59,8 @@ body:
Please also paste or describe the results you observe instead of the expected results. If you observe an error, please paste the error message including the **full** traceback of the exception. It may be relevant to wrap error messages in ```` ```triple quotes blocks``` ````.
Please set the environment variable `export VLLM_LOGGING_LEVEL=DEBUG` to turn on more logging to help debugging potential issues.
If you experienced crashes or hangs, it would be helpful to run vllm with `export VLLM_TRACE_FUNCTION=1` . All the function calls in vllm will be recorded. Inspect these log files, and tell which function crashes or hangs.
placeholder: |
A clear and concise description of what the bug is.
Expand Down
42 changes: 42 additions & 0 deletions .github/workflows/clang-format.yml
Original file line number Diff line number Diff line change
@@ -0,0 +1,42 @@
name: clang-format

on:
# Trigger the workflow on push or pull request,
# but only for the main branch
push:
branches:
- main
pull_request:
branches:
- main

jobs:
clang-format:
runs-on: ubuntu-latest
strategy:
matrix:
python-version: ["3.11"]
steps:
- uses: actions/checkout@v2
- name: Set up Python ${{ matrix.python-version }}
uses: actions/setup-python@v2
with:
python-version: ${{ matrix.python-version }}
- name: Install dependencies
run: |
python -m pip install --upgrade pip
pip install clang-format==18.1.5
- name: Running clang-format
run: |
EXCLUDES=(
'csrc/moe/topk_softmax_kernels.cu'
'csrc/punica/bgmv/bgmv_bf16_bf16_bf16.cu'
'csrc/punica/bgmv/bgmv_config.h'
'csrc/punica/bgmv/bgmv_impl.cuh'
'csrc/punica/bgmv/vec_dtypes.cuh'
'csrc/punica/punica_ops.cu'
'csrc/punica/type_convert.h'
)
find csrc/ \( -name '*.h' -o -name '*.cpp' -o -name '*.cu' -o -name '*.cuh' \) -print \
| grep -vFf <(printf "%s\n" "${EXCLUDES[@]}") \
| xargs clang-format --dry-run --Werror
15 changes: 9 additions & 6 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -167,6 +167,7 @@ set(VLLM_EXT_SRC
"csrc/layernorm_kernels.cu"
"csrc/quantization/squeezellm/quant_cuda_kernel.cu"
"csrc/quantization/gptq/q_gemm.cu"
"csrc/quantization/compressed_tensors/int8_quant_kernels.cu"
"csrc/quantization/fp8/common.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
Expand All @@ -176,7 +177,7 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
include(FetchContent)
SET(CUTLASS_ENABLE_HEADERS_ONLY=ON)
FetchContent_Declare(
cutlass
cutlass
GIT_REPOSITORY https://github.com/nvidia/cutlass.git
# CUTLASS 3.5.0
GIT_TAG 7d49e6c7e2f8896c47f586706e67e1fb215529dc
Expand All @@ -199,11 +200,13 @@ if(VLLM_GPU_LANG STREQUAL "CUDA")
# The CUTLASS kernels for Hopper require sm90a to be enabled.
# This is done via the below gencode option, BUT that creates kernels for both sm90 and sm90a.
# That adds an extra 17MB to compiled binary, so instead we selectively enable it.
set_source_files_properties(
"csrc/quantization/cutlass_w8a8/scaled_mm_dq_c3x.cu"
PROPERTIES
COMPILE_FLAGS
"-gencode arch=compute_90a,code=sm_90a")
if(${CMAKE_CUDA_COMPILER_VERSION} VERSION_GREATER 12.0)
set_source_files_properties(
"csrc/quantization/cutlass_w8a8/scaled_mm_dq_c3x.cu"
PROPERTIES
COMPILE_FLAGS
"-gencode arch=compute_90a,code=sm_90a")
endif()

endif()

Expand Down
2 changes: 2 additions & 0 deletions Dockerfile.cpu
Original file line number Diff line number Diff line change
Expand Up @@ -17,4 +17,6 @@ RUN pip install -v -r requirements-cpu.txt --extra-index-url https://download.py

RUN VLLM_TARGET_DEVICE=cpu python3 setup.py install

WORKDIR /workspace/

CMD ["/bin/bash"]
8 changes: 6 additions & 2 deletions Dockerfile.rocm
Original file line number Diff line number Diff line change
Expand Up @@ -92,19 +92,23 @@ RUN if [ "$BUILD_TRITON" = "1" ]; then \
WORKDIR /vllm-workspace
COPY . .

#RUN python3 -m pip install pynvml # to be removed eventually
RUN python3 -m pip install --upgrade pip numba

# make sure punica kernels are built (for LoRA)
ENV VLLM_INSTALL_PUNICA_KERNELS=1
# Workaround for ray >= 2.10.0
ENV RAY_EXPERIMENTAL_NOSET_ROCR_VISIBLE_DEVICES=1

ENV VLLM_NCCL_SO_PATH=/opt/rocm/lib/librccl.so

RUN --mount=type=cache,target=/root/.cache/pip \
pip install -U -r requirements-rocm.txt \
&& patch /opt/rocm/include/hip/amd_detail/amd_hip_bf16.h ./rocm_patch/rocm_bf16.patch \
&& python3 setup.py install \
&& cp build/lib.linux-x86_64-cpython-39/vllm/_C.cpython-39-x86_64-linux-gnu.so vllm/ \
&& cp build/lib.linux-x86_64-cpython-39/vllm/_punica_C.cpython-39-x86_64-linux-gnu.so vllm/ \
&& cd ..

RUN python3 -m pip install --upgrade pip
RUN python3 -m pip install --no-cache-dir ray[all]==2.9.3

CMD ["/bin/bash"]
10 changes: 6 additions & 4 deletions benchmarks/backend_request_func.py
Original file line number Diff line number Diff line change
@@ -1,7 +1,3 @@
# flake8: noqa
# UPSTREAM SYNC: noqa is required for passing ruff run on nm-automation
# This file has been modified by Neural Magic

import json
import os
import sys
Expand Down Expand Up @@ -93,6 +89,9 @@ async def async_request_tgi(
output.latency = most_recent_timestamp - st
output.success = True
output.generated_text = data["generated_text"]
else:
output.error = response.reason or ""
output.success = False
except Exception:
output.success = False
exc_info = sys.exc_info()
Expand Down Expand Up @@ -280,6 +279,9 @@ async def async_request_openai_completions(
output.generated_text = generated_text
output.success = True
output.latency = latency
else:
output.error = response.reason or ""
output.success = False
except Exception:
output.success = False
exc_info = sys.exc_info()
Expand Down
34 changes: 21 additions & 13 deletions benchmarks/benchmark_latency.py
Original file line number Diff line number Diff line change
Expand Up @@ -3,13 +3,14 @@
import json
import time
from pathlib import Path
from typing import Optional
from typing import List, Optional

import numpy as np
import torch
from tqdm import tqdm

from vllm import LLM, SamplingParams
from vllm.inputs import PromptStrictInputs
from vllm.model_executor.layers.quantization import QUANTIZATION_METHODS


Expand All @@ -34,7 +35,8 @@ def main(args: argparse.Namespace):
use_v2_block_manager=args.use_v2_block_manager,
enable_chunked_prefill=args.enable_chunked_prefill,
download_dir=args.download_dir,
block_size=args.block_size)
block_size=args.block_size,
gpu_memory_utilization=args.gpu_memory_utilization)

sampling_params = SamplingParams(
n=args.n,
Expand All @@ -48,7 +50,9 @@ def main(args: argparse.Namespace):
dummy_prompt_token_ids = np.random.randint(10000,
size=(args.batch_size,
args.input_len))
dummy_prompt_token_ids = dummy_prompt_token_ids.tolist()
dummy_inputs: List[PromptStrictInputs] = [{
"prompt_token_ids": batch
} for batch in dummy_prompt_token_ids.tolist()]

def run_to_completion(profile_dir: Optional[str] = None):
if profile_dir:
Expand All @@ -59,13 +63,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
],
on_trace_ready=torch.profiler.tensorboard_trace_handler(
str(profile_dir))) as p:
llm.generate(prompt_token_ids=dummy_prompt_token_ids,
llm.generate(dummy_inputs,
sampling_params=sampling_params,
use_tqdm=False)
print(p.key_averages())
else:
start_time = time.perf_counter()
llm.generate(prompt_token_ids=dummy_prompt_token_ids,
llm.generate(dummy_inputs,
sampling_params=sampling_params,
use_tqdm=False)
end_time = time.perf_counter()
Expand Down Expand Up @@ -153,15 +157,13 @@ def run_to_completion(profile_dir: Optional[str] = None):
action='store_true',
help='enforce eager mode and disable CUDA graph')
parser.add_argument(
"--kv-cache-dtype",
'--kv-cache-dtype',
type=str,
choices=['auto', 'fp8'],
default='auto',
help=
'Data type for kv cache storage. If "auto", will use model data type. '
'FP8_E5M2 (without scaling) is only supported on cuda version greater '
'than 11.8. On ROCm (AMD GPU), FP8_E4M3 is '
'instead supported for common inference criteria.')
choices=['auto', 'fp8', 'fp8_e5m2', 'fp8_e4m3'],
default="auto",
help='Data type for kv cache storage. If "auto", will use model '
'data type. CUDA 11.8+ supports fp8 (=fp8_e4m3) and fp8_e5m2. '
'ROCm (AMD GPU) supports fp8 (=fp8_e4m3)')
parser.add_argument(
'--quantization-param-path',
type=str,
Expand Down Expand Up @@ -213,5 +215,11 @@ def run_to_completion(profile_dir: Optional[str] = None):
type=str,
default=None,
help='Path to save the latency results in JSON format.')
parser.add_argument('--gpu-memory-utilization',
type=float,
default=0.9,
help='the fraction of GPU memory to be used for '
'the model executor, which can range from 0 to 1.'
'If unspecified, will use the default value of 0.9.')
args = parser.parse_args()
main(args)
Loading

0 comments on commit db9ed90

Please sign in to comment.