You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
--port 80 \
--model-id microsoft/Phi-3-mini-128k-instruct \
--cuda-memory-fraction 0.8 \
--sharded false \
--max-waiting-tokens 20 \
--max-input-length 4096 \
--max-total-tokens 8192 \
--hostname 0.0.0.0 \
--max-concurrent-requests 512 \
--max-best-of 1 \
--max-batch-prefill-tokens $BATCH_TOKEN \
--max-active-adapters 10 \
--adapter-source local \
--adapter-cycle-time-s 2 \
--json-output \
--disable-custom-kernels \
--dtype float16```
### Expected behavior
When running LoraX with the model microsoft/Phi-3-mini-128k-instruct, I encountered unexpected behavior with the following configurations:
Configuration A:
- max-input-length = 4096
- max-total-tokens = 8192
- Prompt Length: Approximately 1000 tokens
In this configuration, the generated response differs significantly from what is produced by VLLM.
Configuration B:
- max-input-length = 4090
- max-total-tokens = 4096
This configuration works well and produces expected results.
Additionally, I tested the model microsoft/Phi-3-mini-4k-instruct, and it also functioned correctly.
It seems there may be an issue with handling long contexts when using microsoft/Phi-3-mini-128k-instruct.
Could you please investigate this issue? I found a related discussion here: [Hugging Face Discussion](https://huggingface.co/microsoft/Phi-3-mini-128k-instruct/discussions/85). Thank you!
The text was updated successfully, but these errors were encountered:
System Info
ghcr.io/predibase/lorax:f1ef0ee
Information
Tasks
Reproduction
The text was updated successfully, but these errors were encountered: