Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add sglang example #92

Open
wants to merge 8 commits into
base: main
Choose a base branch
from
Open

Add sglang example #92

wants to merge 8 commits into from

Conversation

phatvo9
Copy link
Contributor

@phatvo9 phatvo9 commented Nov 14, 2024

No description provided.

@phatvo9 phatvo9 requested a review from luv-bansal November 14, 2024 10:13
Copy link
Contributor

@luv-bansal luv-bansal left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Almost same suggestions from lmdeploy PR

Comment on lines 2 to 4
import subprocess
import sys
import threading
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

These dependencies are using in model.py, and can be removed

orjson
python-multipart

--extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In prod and deb we have cuda 12.4, I'm not sure if it works with this cu121, need to be verified

Copy link
Contributor

@luv-bansal luv-bansal Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

But I tested on q22, which also has cuda 12.4 where prediction is successful but don't thing this will be a issue

Copy link
Contributor

@luv-bansal luv-bansal Nov 18, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I used below requirements with dependencies versions to test locally and it worked. I think it's better to include requirements with it's versions here, because before I don't know why but I was getting error when I didn't specify dependencies versions

torch==2.4.0
tokenizers==0.20.2
transformers==4.46.2
accelerate==0.34.2
scipy==1.10.1
optimum==1.23.3
xformers==0.0.27.post2
einops==0.8.0
requests==2.32.2
packaging
ninja
protobuf==3.20.0

sglang[all]==0.3.5.post2
orjson==3.10.11
python-multipart==0.0.17

--extra-index-url https://flashinfer.ai/whl/cu121/torch2.4/
flashinfer
``

@luv-bansal
Copy link
Contributor

luv-bansal commented Nov 18, 2024

@phatvo9 I uploaded the model on prod, upload is successful but predictions are failing. And looking at prod logs I've got below

[2024-11-18 14:10:41 TP0] Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 171, in __init__
    self.capture()
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 221, in capture
    ) = self.capture_one_batch_size(bs, forward)
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 243, in capture_one_batch_size
    self.model_runner.attn_backend.init_forward_metadata_capture_cuda_graph(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 187, in init_forward_metadata_capture_cuda_graph
    self.indices_updater_decode.update(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 352, in update_single_wrapper
    self.call_begin_forward(
  File "/venv/lib/python3.10/site-packages/sglang/srt/layers/attention/flashinfer_backend.py", line 441, in call_begin_forward
    create_flashinfer_kv_indices_triton[(bs,)](
  File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 345, in <lambda>
    return lambda *args, **kwargs: self.run(grid=grid, warmup=False, *args, **kwargs)
  File "/venv/lib/python3.10/site-packages/triton/runtime/jit.py", line 607, in run
    device = driver.active.get_current_device()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 23, in __getattr__
    self._initialize_obj()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 20, in _initialize_obj
    self._obj = self._init_fn()
  File "/venv/lib/python3.10/site-packages/triton/runtime/driver.py", line 9, in _create_driver
    return actives[0]()
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 371, in __init__
    self.utils = CudaUtils()  # TODO: make static
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 80, in __init__
    mod = compile_module_from_src(Path(os.path.join(dirname, "driver.c")).read_text(), "cuda_utils")
  File "/venv/lib/python3.10/site-packages/triton/backends/nvidia/driver.py", line 57, in compile_module_from_src
    so = _build(name, src_path, tmpdir, library_dirs(), include_dir, libraries)
  File "/venv/lib/python3.10/site-packages/triton/runtime/build.py", line 32, in _build
    raise RuntimeError("Failed to find C compiler. Please specify via CC environment variable.")
RuntimeError: Failed to find C compiler. Please specify via CC environment variable.

During handling of the above exception, another exception occurred:

Traceback (most recent call last):
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 1254, in run_scheduler_process
    scheduler = Scheduler(server_args, port_args, gpu_id, tp_rank, dp_rank)
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/scheduler.py", line 169, in __init__
    self.tp_worker = TpWorkerClass(
  File "/venv/lib/python3.10/site-packages/sglang/srt/managers/tp_worker.py", line 55, in __init__
    self.model_runner = ModelRunner(
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 161, in __init__
    self.init_cuda_graphs()
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/model_runner.py", line 552, in init_cuda_graphs
    self.cuda_graph_runner = CudaGraphRunner(self)
  File "/venv/lib/python3.10/site-packages/sglang/srt/model_executor/cuda_graph_runner.py", line 173, in __init__
    raise Exception(
Exception: Capture cuda graph failed: Failed to find C compiler. Please specify via CC environment variable.
Possible solutions:
1. disable cuda graph by --disable-cuda-graph
2. set --mem-fraction-static to a smaller value (e.g., 0.8 or 0.7)
3. disable torch compile by not using --enable-torch-compile
Open an issue on GitHub https://github.com/sgl-project/sglang/issues/new/choose


W1118 14:10:41.581000 139900693243584 torch/_inductor/compile_worker/subproc_pool.py:126] SubprocPool unclean exit


inference_compute_info:
cpu_limit: "4"
cpu_memory: "24Gi"
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Need to reduce cpu_memory because max 16Gi is available

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Development

Successfully merging this pull request may close these issues.

2 participants