Fail to run server with prefix-caching option #599

prd-tuong-nguyen · 2024-09-11T02:28:50Z

System Info

ghcr.io/predibase/lorax:a8ca5cb
Ubuntu 20.04
GPU A10G

Information

Docker
The CLI directly

Tasks

An officially supported command
My own modifications

Reproduction

docker run --gpus 1 -v ./data:/data -p 8005:80 ghcr.io/predibase/lorax:a8ca5cb \
  --prefix-caching true \
  --port 80 \
  --model-id Open-Orca/Mistral-7B-OpenOrca \
  --cuda-memory-fraction 0.8 \
  --sharded false \
  --max-waiting-tokens 20 \
  --max-input-length 4096 \
  --max-total-tokens 8192 \
  --hostname 0.0.0.0 \
  --max-concurrent-requests 512 \
  --max-best-of 1  \
  --max-batch-prefill-tokens 4096 \
  --max-active-adapters 10 \
  --adapter-source local \
  --adapter-cycle-time-s 2 \
  --json-output \
  --disable-custom-kernels \
  --dtype float16

Expected behavior

The server starts successfully and the prefix-caching works well

The text was updated successfully, but these errors were encountered:

prd-tuong-nguyen · 2024-10-14T09:11:41Z

@tgaddair Hi bro, any update on this?

prd-tuong-nguyen mentioned this issue Sep 11, 2024

Add prefix caching #581

Merged

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Fail to run server with prefix-caching option #599

Fail to run server with prefix-caching option #599

prd-tuong-nguyen commented Sep 11, 2024

prd-tuong-nguyen commented Oct 14, 2024

Fail to run server with prefix-caching option #599

Fail to run server with prefix-caching option #599

Comments

prd-tuong-nguyen commented Sep 11, 2024

System Info

Information

Tasks

Reproduction

Expected behavior

prd-tuong-nguyen commented Oct 14, 2024