Replies: 8 comments 2 replies
-
There are ways to do this. Current DeepJavaLibrary support your use case
Using this container with serving.properties
requirements.txt
will work for your case. Tested with G5 and P4D instances. |
Beta Was this translation helpful? Give feedback.
-
Are we supposed to mention anything in the model.py? |
Beta Was this translation helpful? Give feedback.
-
From what I see, https://docs.djl.ai/docs/demos/aws/sagemaker/large-model-inference/sample-llm/vllm_deploy_llama_13b.html contains the tutorial for doing that. |
Beta Was this translation helpful? Give feedback.
-
Is there any options to use vLLM on the model.py file ? When I try this I got
%%writefile models2/model.py
from djl_python import Input, Output
from vllm import LLM
# Check whether CUDA (thus Nvidia GPU) is avaiable
# Define model and tokenizer function variable
client = None
# Model loader function
def load_model():
global client
client = LLM("mistralai/Mistral-7B-Instruct-v0.2", trust_remote_code=True, tensor_parallel_size=8)
# Handler function
def handle(input: Input):
print('handler called', flush=True)
# Check if input is empty
if input.is_empty():
return None
input = input.get_as_json()
print("ron input", input)
input_prompt = str(input.get('prompt', ''))
if len(input_prompt) < 1:
return None
# Load the model
if client is None:
load_model()
output_words = client.generate(payloads, sampling_params=sampling_params)
# Send result to output
output = Output()
output.add(output_words)
return output |
Beta Was this translation helpful? Give feedback.
-
hey @lanking520 , https://github.com/deepjavalibrary/djl-demo/blob/master/aws/sagemaker/large-model-inference/sample-llm/vllm_rollingbatch_deploy_customized_processing.ipynb |
Beta Was this translation helpful? Give feedback.
-
The sample that Qing has shared is quite old at this point. I recommend that you follow our guide here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables. Replace HF_MODEL_ID with the huggingface model id you are trying deploy. If you have a custom model, or artifacts stored in s3, we have some details on using sagemaker's support for uncompressed model artifacts here https://github.com/deepjavalibrary/djl-serving/blob/master/serving/docs/lmi/deployment_guide/deploying-your-endpoint.md#option-2-configuration---environment-variables. Hope this helps. |
Beta Was this translation helpful? Give feedback.
-
how can this be run without djl-serving? can you run vllm/vllm-openai:latest container on aws sagemaker? if not what needs to be changed to make it work? |
Beta Was this translation helpful? Give feedback.
-
When a model is deployed as a sagemaker endpoint using DLJ+vLLM, is it deployed as an openai compatible server or is it following offline inference within the endpoint? |
Beta Was this translation helpful? Give feedback.
-
Hi, I am trying to test the throughout using vLLMs while inference. I am using amazon sagemaker. My typical notebook example is this one - https://github.com/huggingface/notebooks/blob/5ef609e9078e6248d73f28106e60ddafa9359db1/sagemaker/24_train_bloom_peft_lora/sagemaker-notebook.ipynb . Are there any resources which I can use as reference to deploy an endpoint using Vllm on sagemaker?
Beta Was this translation helpful? Give feedback.
All reactions