-
Notifications
You must be signed in to change notification settings - Fork 532
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
adding example script for hosting an OpenAI API server for OLMo 2 on Modal.com #761
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Can you also add a section in the README describing how you use this?
# https://github.com/modal-labs/modal-examples/blob/ed89980d7288cd35c57f23861ba1b1c8d198f68d/06_gpu_and_ml/llm-serving/vllm_inference.py | ||
|
||
import os | ||
import modal |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
This is an extra dependency. Is there a place where you can explain what needs to be done to get this working?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Added setup information to the README
scripts/olmo2_modal_openai.py
Outdated
.pip_install( | ||
"git+https://github.com/vllm-project/vllm.git@9db713a1dca7e1bc9b6ecf5303c63c7352c52a13", | ||
) | ||
.pip_install( | ||
"git+https://github.com/huggingface/transformers.git@9121ab8fe87a296e57f9846e70153b1a3c555d75", | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Do we still need to depend on shas? I think both vllm and huggingface now have OLMo in main, if not in the latest released version.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
huggingface/transformers is on a tagged version that includes olmo2 as of 5 hours ago, but vLLM still requires a build. I'll try it with main instead of the specific sha, which should work, but that'll be an hour or two for the build to confirm it.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
👍
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
After running this for a while, I'm going to swing back to advocating for pinning, either to a sha or to a tagged version once that catches up, because otherwise we have the likelihood of unacceptable delays to build new versions if we spin up new containers in response to a surge in traffic and a new version has come out since the last time we brought up the container.
With that in mind, do we want to go with a pinned sha we've got, or wait until we get a tagged vllm version with the olmo2 architecture, which I'm guessing is mid-month?
Let me know when I should look again! |
…ge setup since no vllm build required, plus docs cleanup
OK, this should be cleaned up. It's using a version of vLLM that's pinned to a specific SHA per my discussion above, but it's using a prebuilt wheel per https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code instead of a build from source, which is much faster and simplifies the image setup. |
@dirkgr Should be ready for you to look again! I expect the most controversial aspect remaining is the vllm version, which I want to pin so it's not constantly pausing to pull minor new versions whenever they get released, but if you want to wait until there's an official tagged release that includes the OLMo2 architecture rather than continuing to tie to a git hash I'm open to that (especially if you think we're about due for a new official release) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just some minor comments about the instructions
README.md
Outdated
@@ -154,6 +154,48 @@ The quantized model is sensitive to input types and CUDA handling. To avoid pote | |||
|
|||
Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/OLMo-eval) repo. | |||
|
|||
## Modal.com Hosting | |||
|
|||
An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
typo:
An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py. | |
An example script is provided for hosting an OLMo 2 model on Modal.com using the OpenAI API in ./scripts/olmo2_modal_openai.py. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good catch!
README.md
Outdated
|
||
An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py. | ||
To run that: | ||
<ol> |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Why not just use markdown list syntax?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Good question; will change that.
scripts/olmo2_modal_openai.py
Outdated
APP_NAME = "OLMo-2-1124-13B-Instruct-openai" | ||
APP_LABEL = APP_NAME.lower() | ||
|
||
MINUTES = 60 # seconds |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is it minutes or seconds?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The 60 part is measured in seconds. The intent is to create a constant representing 60 seconds that later allows us to specify timeouts that are measured in seconds by saying "5 * minutes" for 300 seconds.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Thanks @cnewell, LGTM
@dirkgr said he's cool with it based on Pete's approval
This is a sample script for Modal.com to stand up a standard OpenAI API server for OLMo 2, as a way for folks to get their own copy running quickly using Modal.