Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

adding example script for hosting an OpenAI API server for OLMo 2 on Modal.com #761

Merged
merged 9 commits into from
Dec 18, 2024

Conversation

cnewell
Copy link
Contributor

@cnewell cnewell commented Dec 5, 2024

This is a sample script for Modal.com to stand up a standard OpenAI API server for OLMo 2, as a way for folks to get their own copy running quickly using Modal.

@cnewell cnewell requested a review from dirkgr December 5, 2024 21:06
dirkgr
dirkgr previously requested changes Dec 5, 2024
Copy link
Member

@dirkgr dirkgr left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Can you also add a section in the README describing how you use this?

# https://github.com/modal-labs/modal-examples/blob/ed89980d7288cd35c57f23861ba1b1c8d198f68d/06_gpu_and_ml/llm-serving/vllm_inference.py

import os
import modal
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is an extra dependency. Is there a place where you can explain what needs to be done to get this working?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Added setup information to the README

Comment on lines 77 to 82
.pip_install(
"git+https://github.com/vllm-project/vllm.git@9db713a1dca7e1bc9b6ecf5303c63c7352c52a13",
)
.pip_install(
"git+https://github.com/huggingface/transformers.git@9121ab8fe87a296e57f9846e70153b1a3c555d75",
)
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Do we still need to depend on shas? I think both vllm and huggingface now have OLMo in main, if not in the latest released version.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

huggingface/transformers is on a tagged version that includes olmo2 as of 5 hours ago, but vLLM still requires a build. I'll try it with main instead of the specific sha, which should work, but that'll be an hour or two for the build to confirm it.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

👍

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

After running this for a while, I'm going to swing back to advocating for pinning, either to a sha or to a tagged version once that catches up, because otherwise we have the likelihood of unacceptable delays to build new versions if we spin up new containers in response to a surge in traffic and a new version has come out since the last time we brought up the container.

With that in mind, do we want to go with a pinned sha we've got, or wait until we get a tagged vllm version with the olmo2 architecture, which I'm guessing is mid-month?

@dirkgr
Copy link
Member

dirkgr commented Dec 5, 2024

Let me know when I should look again!

@cnewell
Copy link
Contributor Author

cnewell commented Dec 9, 2024

OK, this should be cleaned up. It's using a version of vLLM that's pinned to a specific SHA per my discussion above, but it's using a prebuilt wheel per https://docs.vllm.ai/en/latest/getting_started/installation.html#install-the-latest-code instead of a build from source, which is much faster and simplifies the image setup.

@cnewell
Copy link
Contributor Author

cnewell commented Dec 11, 2024

@dirkgr Should be ready for you to look again! I expect the most controversial aspect remaining is the vllm version, which I want to pin so it's not constantly pausing to pull minor new versions whenever they get released, but if you want to wait until there's an official tagged release that includes the OLMo2 architecture rather than continuing to tie to a git hash I'm open to that (especially if you think we're about due for a new official release)

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Just some minor comments about the instructions

README.md Outdated
@@ -154,6 +154,48 @@ The quantized model is sensitive to input types and CUDA handling. To avoid pote

Additional tools for evaluating OLMo models are available at the [OLMo Eval](https://github.com/allenai/OLMo-eval) repo.

## Modal.com Hosting

An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py.
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

typo:

Suggested change
An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py.
An example script is provided for hosting an OLMo 2 model on Modal.com using the OpenAI API in ./scripts/olmo2_modal_openai.py.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good catch!

README.md Outdated

An example script is provided for hosting an OLMo 2 model on Modal.com using a the OpenAI API in ./scripts/olmo2_modal_openai.py.
To run that:
<ol>
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Why not just use markdown list syntax?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Good question; will change that.

@epwalsh epwalsh requested a review from AkshitaB December 17, 2024 17:57
APP_NAME = "OLMo-2-1124-13B-Instruct-openai"
APP_LABEL = APP_NAME.lower()

MINUTES = 60 # seconds
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is it minutes or seconds?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The 60 part is measured in seconds. The intent is to create a constant representing 60 seconds that later allows us to specify timeouts that are measured in seconds by saying "5 * minutes" for 300 seconds.

@cnewell
Copy link
Contributor Author

cnewell commented Dec 18, 2024

@epwalsh @AkshitaB Thank you for the feedback today! Updates in place, with typos fixed and hopefully slightly clearer.

Copy link
Member

@epwalsh epwalsh left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @cnewell, LGTM

@cnewell cnewell dismissed dirkgr’s stale review December 18, 2024 18:59

@dirkgr said he's cool with it based on Pete's approval

@cnewell cnewell merged commit f11c935 into main Dec 18, 2024
11 checks passed
@cnewell cnewell deleted the chrisn/olmo2-modal-script branch December 18, 2024 19:00
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants