Skip to content
This repository has been archived by the owner on Oct 11, 2024. It is now read-only.

[Rel Eng] Upstream sync 2024 06 11 #298

Merged
merged 93 commits into from
Jun 11, 2024
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
93 commits
Select commit Hold shift + click to select a range
4b41095
[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…
dtrifiro Jun 1, 2024
045812f
[Kernel] Refactor CUTLASS kernels to always take scales that reside o…
tlrmchlsmth Jun 1, 2024
db09745
[Kernel] Update Cutlass fp8 configs (#5144)
varun-sundar-rabindranath Jun 1, 2024
46b6b26
[Minor] Fix the path typo in loader.py: save_sharded_states.py -> sav…
dashanji Jun 1, 2024
5b5c2b9
[Bugfix] Fix call to init_logger in openai server (#4765)
NadavShmayo Jun 1, 2024
cb6b7a0
[Feature][Kernel] Support bitsandbytes quantization and QLoRA (#4776)
chenqianfzh Jun 1, 2024
9c2a759
[Bugfix] Remove deprecated @abstractproperty (#5174)
zhuohan123 Jun 1, 2024
fd82eff
[Bugfix]: Fix issues related to prefix caching example (#5177) (#5180)
Delviet Jun 1, 2024
5b6b8ed
[BugFix] Prevent `LLM.encode` for non-generation Models (#5184)
robertgshaw2-neuralmagic Jun 1, 2024
15650a3
Update test_ignore_eos (#4898)
simon-mo Jun 2, 2024
dc64b07
[Frontend][OpenAI] Support for returning max_model_len on /v1/models …
Avinash-Raj Jun 2, 2024
bfc6bc7
[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (#…
divakar-amd Jun 2, 2024
5008643
[Misc] Simplify code and fix type annotations in `conftest.py` (#5118)
DarkLight1337 Jun 2, 2024
c070e44
[Core] Support image processor (#4197)
DarkLight1337 Jun 3, 2024
314398c
[Core] Remove unnecessary copies in flash attn backend (#5138)
Yard1 Jun 3, 2024
1ebb772
[Kernel] Pass a device pointer into the quantize kernel for the scale…
tlrmchlsmth Jun 3, 2024
48e8e3f
[CI/BUILD] enable intel queue for longer CPU tests (#4113)
zhouyuan Jun 3, 2024
a6f0725
[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (#3834)
Kaiyang-Chen Jun 3, 2024
198d784
New CI template on AWS stack (#5110)
khluu Jun 3, 2024
1923dcb
[FRONTEND] OpenAI `tools` support named functions (#5032)
br3no Jun 3, 2024
fa0bba2
[Bugfix] Support `prompt_logprobs==0` (#5217)
toslunar Jun 4, 2024
d8b71e3
[Bugfix] Add warmup for prefix caching example (#5235)
zhuohan123 Jun 4, 2024
1d88071
[Kernel] Enhance MoE benchmarking & tuning script (#4921)
WoosukKwon Jun 4, 2024
7899055
[Bugfix]: During testing, use pytest monkeypatch for safely overridin…
afeldman-nm Jun 4, 2024
0e8a84d
[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…
zifeitong Jun 4, 2024
88368d3
[CI/Build] Add inputs tests (#5215)
DarkLight1337 Jun 4, 2024
756340a
[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…
DamonFool Jun 4, 2024
789553f
[Kernel] Add back batch size 1536 and 3072 to MoE tuning (#5242)
WoosukKwon Jun 4, 2024
c57b71e
[CI/Build] Simplify model loading for `HfRunner` (#5251)
DarkLight1337 Jun 4, 2024
14ec8df
[CI/Build] Reducing CPU CI execution time (#5241)
bigPYJ1151 Jun 4, 2024
3b6f9d6
[CI] mark AMD test as softfail to prevent blockage (#5256)
simon-mo Jun 4, 2024
06bcc97
[Misc] Add transformers version to collect_env.py (#5259)
mgoin Jun 4, 2024
c3a46dd
[Misc] update collect env (#5261)
youkaichao Jun 4, 2024
c6bcf66
[Bugfix] Fix prompt_logprobs when SamplingParams.detokenize is set to…
zifeitong Jun 5, 2024
f5d9197
[Misc] Add CustomOp interface for device portability (#5255)
WoosukKwon Jun 5, 2024
bbfee0c
[Misc] Fix docstring of get_attn_backend (#5271)
WoosukKwon Jun 5, 2024
47c1256
[Frontend] OpenAI API server: Add `add_special_tokens` to ChatComplet…
tomeras91 Jun 5, 2024
d619bd9
[CI] Add nightly benchmarks (#5260)
simon-mo Jun 5, 2024
2cf5911
[misc] benchmark_serving.py -- add ITL results and tweak TPOT results…
tlrmchlsmth Jun 5, 2024
8f5fafa
[Kernel] Add GPU architecture guards to the CUTLASS w8a8 kernels to r…
tlrmchlsmth Jun 5, 2024
0770930
[Model] Correct Mixtral FP8 checkpoint loading (#5231)
comaniac Jun 5, 2024
8310e34
[BugFix] Apply get_cached_tokenizer to the tokenizer setter of LLM (#…
DriverSong Jun 5, 2024
6e32dd4
[Kernel] Re-tune Mixtral MoE configurations for FP8 on H100 (#5238)
pcmoritz Jun 5, 2024
c2c62c8
[Docs] Add Sequoia as sponsors (#5287)
simon-mo Jun 5, 2024
ee3104b
[Speculative Decoding] Add `ProposerWorkerBase` abstract class (#5252)
njhill Jun 5, 2024
1680d99
[BugFix] Fix log message about default max model length (#5284)
njhill Jun 5, 2024
efb32e1
[Bugfix] Make EngineArgs use named arguments for config construction …
mgoin Jun 5, 2024
9a28c64
[Bugfix][Frontend/Core] Don't log exception when AsyncLLMEngine grace…
wuisawesome Jun 5, 2024
2b27f72
[Misc] Skip for logits_scale == 1.0 (#5291)
WoosukKwon Jun 5, 2024
54d2690
[Docs] Add Ray Summit CFP (#5295)
simon-mo Jun 5, 2024
cc2aaba
[CI] Disable flash_attn backend for spec decode (#5286)
simon-mo Jun 5, 2024
d72ae5b
[Frontend][Core] Update Outlines Integration from `FSM` to `Guide` (#…
br3no Jun 5, 2024
08fd788
[CI/Build] Update vision tests (#5307)
DarkLight1337 Jun 6, 2024
cbfd3d9
Bugfix: fix broken of download models from modelscope (#5233)
liuyhwangyh Jun 6, 2024
7bb7e9b
[Kernel] Retune Mixtral 8x22b configs for FP8 on H100 (#5294)
pcmoritz Jun 6, 2024
fbd60f3
[Frontend] enable passing multiple LoRA adapters at once to generate(…
mgoldey Jun 6, 2024
14a49c2
[Core] Avoid copying prompt/output tokens if no penalties are used (#…
Yard1 Jun 7, 2024
a60515d
[Core] Change LoRA embedding sharding to support loading methods (#5038)
Yard1 Jun 7, 2024
653a080
[Misc] Missing error message for custom ops import (#5282)
DamonFool Jun 7, 2024
219a385
[Feature][Frontend]: Add support for `stream_options` in `ChatComplet…
Etelis Jun 7, 2024
bd66622
[Misc][Utils] allow get_open_port to be called for multiple times (#5…
youkaichao Jun 7, 2024
ed99ec9
[Kernel] Switch fp8 layers to use the CUTLASS kernels (#5183)
tlrmchlsmth Jun 7, 2024
50520b4
Remove Ray health check (#4693)
Yard1 Jun 7, 2024
98744f9
Addition of lacked ignored_seq_groups in _schedule_chunked_prefill (#…
JamesLim-sy Jun 7, 2024
334e0a7
[Kernel] Dynamic Per-Token Activation Quantization (#5037)
dsikka Jun 7, 2024
17984a7
[Frontend] Add OpenAI Vision API Support (#5237)
ywang96 Jun 7, 2024
3da0119
[Misc] Remove unused cuda_utils.h in CPU backend (#5345)
DamonFool Jun 7, 2024
d65c3ab
fix DbrxFusedNormAttention missing cache_config (#5340)
Calvinnncy97 Jun 7, 2024
e349c2d
[Bug Fix] Fix the support check for FP8 CUTLASS (#5352)
cli99 Jun 8, 2024
4d5b699
[Misc] Add args for selecting distributed executor to benchmarks (#5335)
BKitor Jun 8, 2024
f12b636
[ROCm][AMD] Use pytorch sdpa math backend to do naive attention (#4965)
hongxiayang Jun 8, 2024
842974c
[CI/Test] improve robustness of test (hf_runner) (#5347)
youkaichao Jun 8, 2024
2a16c03
[CI/Test] improve robustness of test (vllm_runner) (#5357)
youkaichao Jun 8, 2024
f8fe956
[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…
mgoin Jun 8, 2024
550ed83
[Core][CUDA Graph] add output buffer for cudagraph (#5074)
youkaichao Jun 9, 2024
52a90dd
[mis][ci/test] fix flaky test in test_sharded_state_loader.py (#5361)
youkaichao Jun 9, 2024
d20586a
[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custo…
bnellnm Jun 9, 2024
27e68e9
[Bugfix] Fix KeyError: 1 When Using LoRA adapters (#5164)
BlackBird-Coding Jun 9, 2024
8f865f6
[Misc] Update to comply with the new `compressed-tensors` config (#5350)
dsikka Jun 10, 2024
d3bd135
[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API S…
ywang96 Jun 10, 2024
b21be06
[misc][typo] fix typo (#5372)
youkaichao Jun 10, 2024
1b41d11
[Misc] Improve error message when LoRA parsing fails (#5194)
DarkLight1337 Jun 10, 2024
f932e32
[Model] Initial support for LLaVA-NeXT (#4199)
DarkLight1337 Jun 10, 2024
e3f0b32
[Feature][Frontend]: Continued `stream_options` implementation also …
Etelis Jun 10, 2024
f8392d6
[Bugfix] Fix LLaVA-NeXT (#5380)
DarkLight1337 Jun 10, 2024
9d82433
[ci] Use small_cpu_queue for doc build (#5331)
khluu Jun 10, 2024
a9bd95b
[ci] Mount buildkite agent on Docker container to upload benchmark re…
khluu Jun 10, 2024
6823d9e
[Docs] Add Docs on Limitations of VLM Support (#5383)
ywang96 Jun 10, 2024
ca0ae3c
[Docs] Alphabetically sort sponsors (#5386)
WoosukKwon Jun 10, 2024
16be761
Bump version to v0.5.0 (#5384)
simon-mo Jun 10, 2024
1444822
format
robertgshaw2-neuralmagic Jun 11, 2024
2df326f
updated test model logprobs
robertgshaw2-neuralmagic Jun 11, 2024
446a144
format
robertgshaw2-neuralmagic Jun 11, 2024
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
26 changes: 26 additions & 0 deletions .buildkite/nightly-benchmarks/kickoff-pipeline.sh
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
#!/usr/bin/env bash

set -euo pipefail

# Install system packages
apt update
apt install -y curl jq

# Install minijinja for templating
curl -sSfL https://github.com/mitsuhiko/minijinja/releases/latest/download/minijinja-cli-installer.sh | sh
source $HOME/.cargo/env

# If BUILDKITE_PULL_REQUEST != "false", then we check the PR labels using curl and jq
if [ "$BUILDKITE_PULL_REQUEST" != "false" ]; then
PR_LABELS=$(curl -s "https://api.github.com/repos/vllm-project/vllm/pulls/$BUILDKITE_PULL_REQUEST" | jq -r '.labels[].name')

if [[ $PR_LABELS == *"perf-benchmarks"* ]]; then
echo "This PR has the 'perf-benchmarks' label. Proceeding with the nightly benchmarks."
else
echo "This PR does not have the 'perf-benchmarks' label. Skipping the nightly benchmarks."
exit 0
fi
fi

# Upload sample.yaml
buildkite-agent pipeline upload .buildkite/nightly-benchmarks/sample.yaml
39 changes: 39 additions & 0 deletions .buildkite/nightly-benchmarks/sample.yaml
Original file line number Diff line number Diff line change
@@ -0,0 +1,39 @@
steps:
# NOTE(simon): You can create separate blocks for different jobs
- label: "A100: NVIDIA SMI"
agents:
queue: A100
plugins:
- kubernetes:
podSpec:
containers:
# - image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:$BUILDKITE_COMMIT
# TODO(simon): check latest main branch or use the PR image.
- image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:45c35f0d58f4508bf43bd6af1d3d0d0ec0c915e6
command:
- bash -c 'nvidia-smi && nvidia-smi topo -m && pwd && ls'
resources:
limits:
nvidia.com/gpu: 8
volumeMounts:
- name: devshm
mountPath: /dev/shm
nodeSelector:
nvidia.com/gpu.product: NVIDIA-A100-SXM4-80GB
volumes:
- name: devshm
emptyDir:
medium: Memory
# TODO(simon): bring H100 online
# - label: "H100: NVIDIA SMI"
# agents:
# queue: H100
# plugins:
# - docker#v5.11.0:
# image: us-central1-docker.pkg.dev/vllm-405802/vllm-ci-test-repo/vllm-test:45c35f0d58f4508bf43bd6af1d3d0d0ec0c915e6
# command:
# - bash -c 'nvidia-smi && nvidia-smi topo -m'
# propagate-environment: true
# ipc: host
# gpus: all

8 changes: 4 additions & 4 deletions .buildkite/run-benchmarks.sh
Original file line number Diff line number Diff line change
Expand Up @@ -50,16 +50,16 @@ echo "### Serving Benchmarks" >> benchmark_results.md
sed -n '1p' benchmark_serving.txt >> benchmark_results.md # first line
echo "" >> benchmark_results.md
echo '```' >> benchmark_results.md
tail -n 20 benchmark_serving.txt >> benchmark_results.md # last 20 lines
tail -n 24 benchmark_serving.txt >> benchmark_results.md # last 24 lines
echo '```' >> benchmark_results.md

# if the agent binary is not found, skip uploading the results, exit 0
if [ ! -f /workspace/buildkite-agent ]; then
if [ ! -f buildkite-agent ]; then
exit 0
fi

# upload the results to buildkite
/workspace/buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md
buildkite-agent annotate --style "info" --context "benchmark-results" < benchmark_results.md

# exit with the exit code of the benchmarks
if [ $bench_latency_exit_code -ne 0 ]; then
Expand All @@ -75,4 +75,4 @@ if [ $bench_serving_exit_code -ne 0 ]; then
fi

rm ShareGPT_V3_unfiltered_cleaned_split.json
/workspace/buildkite-agent artifact upload "*.json"
buildkite-agent artifact upload "*.json"
14 changes: 12 additions & 2 deletions .buildkite/run-cpu-test.sh
Original file line number Diff line number Diff line change
Expand Up @@ -10,5 +10,15 @@ remove_docker_container() { docker rm -f cpu-test || true; }
trap remove_docker_container EXIT
remove_docker_container

# Run the image and launch offline inference
docker run --network host --env VLLM_CPU_KVCACHE_SPACE=1 --name cpu-test cpu-test python3 vllm/examples/offline_inference.py
# Run the image
docker run -itd -v ~/.cache/huggingface:/root/.cache/huggingface --cpuset-cpus=48-95 --cpuset-mems=1 --network host -e HF_TOKEN --env VLLM_CPU_KVCACHE_SPACE=4 --name cpu-test cpu-test

# offline inference
docker exec cpu-test bash -c "python3 examples/offline_inference.py"

# Run basic model test
docker exec cpu-test bash -c "cd tests;
pip install pytest Pillow protobuf
bash ../.buildkite/download-images.sh
cd ../
pytest -v -s tests/models --ignore=tests/models/test_llava.py --ignore=tests/models/test_embedding.py --ignore=tests/models/test_registry.py"
30 changes: 16 additions & 14 deletions .buildkite/test-pipeline.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -45,7 +45,8 @@ steps:
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_basic_distributed_correctness.py
- TEST_DIST_MODEL=facebook/opt-125m DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- TEST_DIST_MODEL=meta-llama/Llama-2-7b-hf DISTRIBUTED_EXECUTOR_BACKEND=mp pytest -v -s distributed/test_chunked_prefill_distributed.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
- pytest -v -s spec_decode/e2e/test_integration_dist.py
andy-neuma marked this conversation as resolved.
Show resolved Hide resolved
- CUDA_VISIBLE_DEVICES=0,1 pytest -v -s test_sharded_state_loader.py

- label: Distributed Tests (Multiple Groups)
#mirror_hardwares: [amd]
Expand All @@ -62,7 +63,6 @@ steps:
mirror_hardwares: [amd]

commands:
- pytest -v -s test_inputs.py
- pytest -v -s entrypoints -m llm
- pytest -v -s entrypoints -m openai

Expand All @@ -79,6 +79,13 @@ steps:
- python3 llava_example.py
- python3 tensorize_vllm_model.py --model facebook/opt-125m serialize --serialized-directory /tmp/ --suffix v1 && python3 tensorize_vllm_model.py --model facebook/opt-125m deserialize --path-to-tensors /tmp/vllm/facebook/opt-125m/v1/model.tensors

- label: Inputs Test
#mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s test_inputs.py
- pytest -v -s multimodal

- label: Kernels Test %N
#mirror_hardwares: [amd]
command: pytest -v -s kernels --shard-id=$$BUILDKITE_PARALLEL_JOB --num-shards=$$BUILDKITE_PARALLEL_JOB_COUNT
Expand All @@ -87,14 +94,13 @@ steps:
- label: Models Test
#mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models --ignore=models/test_llava.py
- pytest -v -s models -m \"not llava\"

- label: Llava Test
mirror_hardwares: [amd]
commands:
- bash ../.buildkite/download-images.sh
- pytest -v -s models/test_llava.py
- pytest -v -s models -m llava

- label: Prefix Caching Test
mirror_hardwares: [amd]
Expand All @@ -118,7 +124,10 @@ steps:

- label: Speculative decoding tests
#mirror_hardwares: [amd]
command: pytest -v -s spec_decode
commands:
# See https://github.com/vllm-project/vllm/issues/5152
- export VLLM_ATTENTION_BACKEND=XFORMERS
- pytest -v -s spec_decode

- label: LoRA Test %N
#mirror_hardwares: [amd]
Expand All @@ -130,14 +139,7 @@ steps:
num_gpus: 4
# This test runs llama 13B, so it is required to run on 4 GPUs.
commands:
# Temporarily run this way because we cannot clean up GPU mem usage
# for multi GPU tests.
# TODO(sang): Fix it.
- pytest -v -s lora/test_long_context.py::test_rotary_emb_replaced
- pytest -v -s lora/test_long_context.py::test_batched_rope_kernel
- pytest -v -s lora/test_long_context.py::test_self_consistency
- pytest -v -s lora/test_long_context.py::test_quality
- pytest -v -s lora/test_long_context.py::test_max_len
- pytest -v -s -x lora/test_long_context.py

- label: Tensorizer Test
#mirror_hardwares: [amd]
Expand Down
64 changes: 64 additions & 0 deletions .buildkite/test-template-aws.j2
Original file line number Diff line number Diff line change
@@ -0,0 +1,64 @@
{% set docker_image = "public.ecr.aws/q9t5s3a7/vllm-ci-test-repo:$BUILDKITE_COMMIT" %}
{% set default_working_dir = "/vllm-workspace/tests" %}

steps:
- label: ":docker: build image"
agents:
queue: cpu_queue
commands:
- "aws ecr-public get-login-password --region us-east-1 | docker login --username AWS --password-stdin public.ecr.aws/q9t5s3a7"
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
- "docker push {{ docker_image }}"
env:
DOCKER_BUILDKIT: "1"
retry:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
- wait

{% for step in steps %}
- label: "{{ step.label }}"
agents:
{% if step.label == "Documentation Build" %}
queue: small_cpu_queue
{% elif step.no_gpu %}
queue: cpu_queue
{% elif step.num_gpus == 2 or step.num_gpus == 4 %}
queue: gpu_4_queue
{% else %}
queue: gpu_1_queue
{% endif %}
soft_fail: true
{% if step.parallelism %}
parallelism: {{ step.parallelism }}
{% endif %}
retry:
automatic:
- exit_status: -1 # Agent was lost
limit: 5
- exit_status: -10 # Agent was lost
limit: 5
plugins:
- docker#v5.2.0:
image: {{ docker_image }}
always-pull: true
propagate-environment: true
{% if not step.no_gpu %}
gpus: all
{% endif %}
{% if step.label == "Benchmarks" %}
mount-buildkite-agent: true
{% endif %}
command: ["bash", "-c", "cd {{ (step.working_dir or default_working_dir) | safe }} && {{ step.command or (step.commands | join(' && ')) | safe }}"]
environment:
- VLLM_USAGE_SOURCE=ci-test
- HF_TOKEN
{% if step.label == "Speculative decoding tests" %}
- VLLM_ATTENTION_BACKEND=XFORMERS
{% endif %}
volumes:
- /dev/shm:/dev/shm
{% endfor %}
7 changes: 5 additions & 2 deletions .buildkite/test-template.j2
Original file line number Diff line number Diff line change
Expand Up @@ -4,7 +4,7 @@

steps:
- label: ":docker: build image"
commands:
commands:
- "docker build --build-arg max_jobs=16 --tag {{ docker_image }} --target test --progress plain ."
- "docker push {{ docker_image }}"
env:
Expand All @@ -28,6 +28,7 @@ steps:
command: bash .buildkite/run-amd-test.sh "cd {{ (step.working_dir or default_working_dir) | safe }} ; {{ step.command or (step.commands | join(" ; ")) | safe }}"
env:
DOCKER_BUILDKIT: "1"
soft_fail: true
{% endif %}
{% endfor %}

Expand All @@ -36,10 +37,12 @@ steps:
agents:
queue: neuron
command: bash .buildkite/run-neuron-test.sh
soft_fail: true
soft_fail: false

- label: "Intel Test"
depends_on: ~
agents:
queue: intel
command: bash .buildkite/run-cpu-test.sh

{% for step in steps %}
Expand Down
1 change: 1 addition & 0 deletions .github/workflows/mypy.yaml
Original file line number Diff line number Diff line change
Expand Up @@ -37,6 +37,7 @@ jobs:
mypy vllm/distributed --config-file pyproject.toml
mypy vllm/entrypoints --config-file pyproject.toml
mypy vllm/executor --config-file pyproject.toml
mypy vllm/multimodal --config-file pyproject.toml
mypy vllm/usage --config-file pyproject.toml
mypy vllm/*.py --config-file pyproject.toml
mypy vllm/transformers_utils --config-file pyproject.toml
Expand Down
30 changes: 9 additions & 21 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -66,19 +66,6 @@ endif()
#
find_package(Torch REQUIRED)

#
# Normally `torch.utils.cpp_extension.CUDAExtension` would add
# `libtorch_python.so` for linking against an extension. Torch's cmake
# configuration does not include this library (presumably since the cmake
# config is used for standalone C++ binaries that link against torch).
# The `libtorch_python.so` library defines some of the glue code between
# torch/python via pybind and is required by VLLM extensions for this
# reason. So, add it by manually with `find_library` using torch's
# installed library path.
#
find_library(torch_python_LIBRARY torch_python PATHS
"${TORCH_INSTALL_PREFIX}/lib")

#
# Forward the non-CUDA device extensions to external CMake scripts.
#
Expand Down Expand Up @@ -171,7 +158,7 @@ set(VLLM_EXT_SRC
"csrc/quantization/fp8/common.cu"
"csrc/cuda_utils_kernels.cu"
"csrc/moe_align_block_size_kernels.cu"
"csrc/pybind.cpp")
"csrc/torch_bindings.cpp")

if(VLLM_GPU_LANG STREQUAL "CUDA")
include(FetchContent)
Expand Down Expand Up @@ -218,14 +205,15 @@ define_gpu_extension_target(
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
INCLUDE_DIRECTORIES ${CUTLASS_INCLUDE_DIR};${CUTLASS_TOOLS_UTIL_INCLUDE_DIR}
USE_SABI 3
WITH_SOABI)

#
# _moe_C extension
#

set(VLLM_MOE_EXT_SRC
"csrc/moe/moe_ops.cpp"
"csrc/moe/torch_bindings.cpp"
"csrc/moe/topk_softmax_kernels.cu")

define_gpu_extension_target(
Expand All @@ -235,6 +223,7 @@ define_gpu_extension_target(
SOURCES ${VLLM_MOE_EXT_SRC}
COMPILE_FLAGS ${VLLM_GPU_FLAGS}
ARCHITECTURES ${VLLM_GPU_ARCHES}
USE_SABI 3
WITH_SOABI)

#
Expand All @@ -249,7 +238,7 @@ set(VLLM_PUNICA_EXT_SRC
"csrc/punica/bgmv/bgmv_fp32_bf16_bf16.cu"
"csrc/punica/bgmv/bgmv_fp32_fp16_fp16.cu"
"csrc/punica/punica_ops.cu"
"csrc/punica/punica_pybind.cpp")
"csrc/punica/torch_bindings.cpp")

#
# Copy GPU compilation flags+update for punica
Expand Down Expand Up @@ -286,6 +275,7 @@ if (VLLM_PUNICA_GPU_ARCHES)
SOURCES ${VLLM_PUNICA_EXT_SRC}
COMPILE_FLAGS ${VLLM_PUNICA_GPU_FLAGS}
ARCHITECTURES ${VLLM_PUNICA_GPU_ARCHES}
USE_SABI 3
WITH_SOABI)
else()
message(WARNING "Unable to create _punica_C target because none of the "
Expand All @@ -311,6 +301,9 @@ if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
message(STATUS "Enabling C extension.")
add_dependencies(default _C)

message(STATUS "Enabling moe extension.")
add_dependencies(default _moe_C)

# Enable punica if -DVLLM_INSTALL_PUNICA_KERNELS=ON or
# VLLM_INSTALL_PUNICA_KERNELS is set in the environment and
# there are supported target arches.
Expand All @@ -320,8 +313,3 @@ if(VLLM_GPU_LANG STREQUAL "CUDA" OR VLLM_GPU_LANG STREQUAL "HIP")
add_dependencies(default _punica_C)
endif()
endif()

if(VLLM_GPU_LANG STREQUAL "CUDA")
message(STATUS "Enabling moe extension.")
add_dependencies(default _moe_C)
endif()
Loading
Loading