[Rel Eng] Upstream sync 2024 06 11 #298

robertgshaw2-neuralmagic · 2024-06-11T01:38:31Z

Upstream sync 2024 06 11 (#288)

SUMMARY:

Merge commits from vllm-project@1197e02 to vllm-project@114332b
Our GCP test instances do not have gcc or clang installed. All of the triton kernels rely on the gcc and clang to generate JITs. These are still disabled (cc @andy-neuma). All are marked with:

@pytest.mark.skip("C compiler not installed in NM automation. "
                  "This codepath follows a triton pathway, which "
                  "JITs using clang or gcc. Since neither are installed "
                  "in our test instances, we need to skip this for now.")

Note that vllm-project@1197e02 is NOT included in this merge.

COMPARE vs UPSTREAM:

https://github.com/neuralmagic/nm-vllm/compare/upstream-sync-2024-06-11..vllm-project:vllm:v0.5.0

…e time (vllm-project#5034)

…n the GPU (vllm-project#5137)

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

…e_sharded_state.py (vllm-project#5151) Signed-off-by: Ye Cao <[email protected]>

…roject#4776)

…5177) (vllm-project#5180)

…#5184) Co-authored-by: mgoin <[email protected]>

…response (vllm-project#4643)

…llm-project#4927) This PR enables the fused topk_softmax kernel used in moe layer for HIP

…project#5118)

…5138)

vllm-project#5159)

Signed-off-by: kevin <[email protected]>

…g the env var that indicates the vLLM backend (vllm-project#5210)

…tor (vllm-project#5229)

…ackend (vllm-project#5249)

…t#5242)

…_scale (vllm-project#5353)

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

…roject#5361) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (vllm-project#5361)

…m ops (vllm-project#5047)

…m-project#5350) Co-authored-by: Michael Goin <[email protected]>

…erver (vllm-project#5374)

Co-authored-by: Roger Wang <[email protected]>

…n CompletionRequest (vllm-project#5319)

Signed-off-by: kevin <[email protected]>

…sults (vllm-project#5330) Signed-off-by: kevin <[email protected]>

andy-neuma

cool

.buildkite/test-pipeline.yaml

andy-neuma

thanks.

dtrifiro and others added 30 commits June 11, 2024 01:17

[CI/Build] CMakeLists: build all extensions' cmake targets at the sam…

4b41095

…e time (vllm-project#5034)

[Kernel] Refactor CUTLASS kernels to always take scales that reside o…

045812f

…n the GPU (vllm-project#5137)

[Kernel] Update Cutlass fp8 configs (vllm-project#5144)

db09745

Co-authored-by: Varun Sundar Rabindranath <[email protected]> Co-authored-by: Robert Shaw <[email protected]>

[Minor] Fix the path typo in loader.py: save_sharded_states.py -> sav…

46b6b26

…e_sharded_state.py (vllm-project#5151) Signed-off-by: Ye Cao <[email protected]>

[Bugfix] Fix call to init_logger in openai server (vllm-project#4765)

5b5c2b9

[Feature][Kernel] Support bitsandbytes quantization and QLoRA (vllm-p…

cb6b7a0

…roject#4776)

[Bugfix] Remove deprecated @abstractproperty (vllm-project#5174)

9c2a759

[Bugfix]: Fix issues related to prefix caching example (vllm-project#…

fd82eff

…5177) (vllm-project#5180)

[BugFix] Prevent LLM.encode for non-generation Models (vllm-project…

5b6b8ed

…#5184) Co-authored-by: mgoin <[email protected]>

Update test_ignore_eos (vllm-project#4898)

15650a3

[Frontend][OpenAI] Support for returning max_model_len on /v1/models …

dc64b07

…response (vllm-project#4643)

[Kernel][ROCm][AMD] enable fused topk_softmax kernel for moe layer (v…

bfc6bc7

…llm-project#4927) This PR enables the fused topk_softmax kernel used in moe layer for HIP

[Misc] Simplify code and fix type annotations in conftest.py (vllm-…

5008643

…project#5118)

[Core] Support image processor (vllm-project#4197)

c070e44

[Core] Remove unnecessary copies in flash attn backend (vllm-project#…

314398c

…5138)

[Kernel] Pass a device pointer into the quantize kernel for the scales (

1ebb772

vllm-project#5159)

[CI/BUILD] enable intel queue for longer CPU tests (vllm-project#4113)

48e8e3f

[Misc]: Implement CPU/GPU swapping in BlockManagerV2 (vllm-project#3834)

a6f0725

New CI template on AWS stack (vllm-project#5110)

198d784

Signed-off-by: kevin <[email protected]>

[FRONTEND] OpenAI tools support named functions (vllm-project#5032)

1923dcb

[Bugfix] Support prompt_logprobs==0 (vllm-project#5217)

fa0bba2

[Bugfix] Add warmup for prefix caching example (vllm-project#5235)

d8b71e3

[Kernel] Enhance MoE benchmarking & tuning script (vllm-project#4921)

1d88071

[Bugfix]: During testing, use pytest monkeypatch for safely overridin…

7899055

…g the env var that indicates the vLLM backend (vllm-project#5210)

[Bugfix] Fix torch.compile() error when using MultiprocessingGPUExecu…

0e8a84d

…tor (vllm-project#5229)

[CI/Build] Add inputs tests (vllm-project#5215)

88368d3

[Bugfix] Fix a bug caused by pip install setuptools>=49.4.0 for CPU b…

756340a

…ackend (vllm-project#5249)

[Kernel] Add back batch size 1536 and 3072 to MoE tuning (vllm-projec…

789553f

…t#5242)

[CI/Build] Simplify model loading for HfRunner (vllm-project#5251)

c57b71e

[CI/Build] Reducing CPU CI execution time (vllm-project#5241)

14ec8df

mgoin and others added 19 commits June 11, 2024 01:31

[Misc][Breaking] Change FP8 checkpoint format from act_scale -> input…

f8fe956

…_scale (vllm-project#5353)

[Core][CUDA Graph] add output buffer for cudagraph (vllm-project#5074)

550ed83

[Core][CUDA Graph] add output buffer for cudagraph to reduce memory footprint (vllm-project#5074)

[mis][ci/test] fix flaky test in test_sharded_state_loader.py (vllm-p…

52a90dd

…roject#5361) [mis][ci/test] fix flaky test in tests/test_sharded_state_loader.py (vllm-project#5361)

[Kernel][Misc] Use TORCH_LIBRARY instead of PYBIND11_MODULE for custo…

d20586a

…m ops (vllm-project#5047)

[Bugfix] Fix KeyError: 1 When Using LoRA adapters (vllm-project#5164)

27e68e9

[Misc] Update to comply with the new compressed-tensors config (vll…

8f865f6

…m-project#5350) Co-authored-by: Michael Goin <[email protected]>

[Frontend][Misc] Enforce Pixel Values as Input Type for VLMs in API S…

d3bd135

…erver (vllm-project#5374)

[misc][typo] fix typo (vllm-project#5372)

b21be06

[Misc] Improve error message when LoRA parsing fails (vllm-project#5194)

1b41d11

[Model] Initial support for LLaVA-NeXT (vllm-project#4199)

f932e32

Co-authored-by: Roger Wang <[email protected]>

[Feature][Frontend]: Continued stream_options implementation also i…

e3f0b32

…n CompletionRequest (vllm-project#5319)

[Bugfix] Fix LLaVA-NeXT (vllm-project#5380)

f8392d6

[ci] Use small_cpu_queue for doc build (vllm-project#5331)

9d82433

Signed-off-by: kevin <[email protected]>

[ci] Mount buildkite agent on Docker container to upload benchmark re…

a9bd95b

…sults (vllm-project#5330) Signed-off-by: kevin <[email protected]>

[Docs] Add Docs on Limitations of VLM Support (vllm-project#5383)

6823d9e

[Docs] Alphabetically sort sponsors (vllm-project#5386)

ca0ae3c

Bump version to v0.5.0 (vllm-project#5384)

16be761

format

1444822

updated test model logprobs

2df326f

robertgshaw2-neuralmagic requested review from andy-neuma, mgoin and dhuangnm June 11, 2024 15:29

format

446a144

andy-neuma reviewed Jun 11, 2024

View reviewed changes

.buildkite/test-pipeline.yaml Show resolved Hide resolved

andy-neuma approved these changes Jun 11, 2024

View reviewed changes

robertgshaw2-neuralmagic merged commit b9fd1d5 into main Jun 11, 2024
37 checks passed

robertgshaw2-neuralmagic deleted the upstream-sync-2024-06-11 branch June 11, 2024 20:01

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[Rel Eng] Upstream sync 2024 06 11 #298

[Rel Eng] Upstream sync 2024 06 11 #298

robertgshaw2-neuralmagic commented Jun 11, 2024 •

edited

Loading

andy-neuma left a comment

andy-neuma left a comment

[Rel Eng] Upstream sync 2024 06 11 #298

[Rel Eng] Upstream sync 2024 06 11 #298

Conversation

robertgshaw2-neuralmagic commented Jun 11, 2024 • edited Loading

andy-neuma left a comment

Choose a reason for hiding this comment

andy-neuma left a comment

Choose a reason for hiding this comment

robertgshaw2-neuralmagic commented Jun 11, 2024 •

edited

Loading