Releases: openvinotoolkit/model_server
OpenVINO™ Model Server 2024.4
The 2024.4 release brings official support for OpenAI API text generation. It is now recommended for production usage. It comes with a set of added features and improvements.
Changes and improvements
-
Significant performance improvements for multinomial sampling algorithm
-
finish_reason
in the response correctly determines reaching the max_tokens (length) and completed the sequence (stop) -
Added automatic cancelling of text generation for disconnected clients
-
Included prefix caching feature which speeds up text generation by caching the prompt evaluation
-
Option to compress the KV Cache to lower precision – it reduces the memory consumption with minimal impact on accuracy
-
Added support for
stop
sampling parameters. It can define a sequence which stops text generation. -
Added support for
logprobs
sampling parameter. It returns the probabilities of generated tokens. -
Included generic metrics related to execution of MediaPipe graph. Metric
ovms_current_graphs
can be used for autoscaling based on current load and the level of concurrency. Counters likeovms_requests_accepted
andovms_responses
can track the activity of the server. -
Included demo of text generation horizontal scalability
-
Configurable handling of non-UTF-8 responses from the model – detokenizer can now automatically change then to Unicode replacement character
-
Included support for Llama3.1 models
-
Text generation is supported both on CPU and GPU -check the demo
Breaking changes
No breaking changes.
Bug fixes
-
Security and stability improvements
-
Fixed handling of model templates without bos_token
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.4
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.4-gpu
- CPU, GPU and NPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.3
The 2024.3 release focus mostly on improvements in OpenAI API text generation implementation.
Changes and improvements
A set of improvements in OpenAI API text generation:
- Significantly better performance thanks to numerous improvements in OpenVINO Runtime and sampling algorithms
- Added config parameters
best_of_limit
andmax_tokens_limit
to avoid memory overconsumption impact from invalid requests Read more - Added reporting LLM metrics in the server logs Read more
- Added extra sampling parameters
diversity_penalty
,length_penalty
,repetition_penalty
. Read more
Improvements in documentation and demos:
- Added RAG demo with OpenAI API
- Added K8S deployment demo for text generation scenarios
- Simplified models initialization for a set of demos with mediapipe graphs using pose_detection model. TFLite models don't required any conversions Check demo
Breaking changes
No breaking changes.
Bug fixes
- Resolved issue with sporadic text generation hang via OpenAI API endpoints
- Fixed issue with chat streamer impacting incomplete utf-8 sequences
- Corrected format of the last streaming event in
completions
endpoint - Fixed issue with request hanging when running out of available cache
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.3
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.3-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.2
The major new functionality in 2024.2 is a preview feature of OpenAI compatible API for text generation along with state of the art techniques like continuous batching and paged attention for improving efficiency of generative workloads.
Changes and improvements
-
Updated OpenVINO Runtime backend to 2024.2
-
OpenVINO Model Server can be now used for text generation use cases using OpenAI compatible API
-
Added support for continuous batching and PagedAttention algorithms for text generation with fast and efficient in high concurrency load especially on Intel Xeon processors. Learn more about it.
-
Added LLM text generation OpenAI API demo.
-
Added notebook showcasing RAG algorithm with online scope changes delegated to the model server. Link
-
Enabled python 3.12 for python clients, samples and demos.
-
Updated RedHat UBI base image to 8.10
Breaking changes
No breaking changes.
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.2
- CPU device support with the image based on Ubuntu 22.04
docker pull openvino/model_server:2024.2-gpu
- GPU and CPU device support with the image based on Ubuntu 22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.1
The 2024.1 has a few improvements in the serving functionality, demo enhancements and bug fixes.
Changes and improvements
-
Updated OpenVINO Runtime backend to 2024.1 Link
-
Added support for OpenVINO models with string data type on output. Together with the features introduced in 2024.0, now OVMS can support models with input and output of string type. That way you can take advantage of the tokenization built into the model as the first layer. You can also rely on any post-processing embedded into the model which returns just text. Check universal sentence encoder demo and image classification with string output demo
-
Updated MediaPipe python calculators to support relative path for all related configuration and python code files. Now, the complete graph configuration folder can be deployed in arbitrary path without any code changes. It is demonstrated in the updated text generation demo.
-
Extended support for KServe REST API for MediaPipe graph endpoints. Now you can send the data in KServe JSON body. Check how it is used in text generation use case.
-
Added demo showcasing full RAG algorithm entirely delegated to the model server Link
-
Added RedHat UBI based Dockerfile for python demos, usage documented in python demos
Breaking changes
No breaking changes.
Bug fixes
- Improvements in error handling for invalid requests and incorrect configuration
- Fixes in the demos and documentation
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.1
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.1-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2024.0
The 2024.0 includes new version of OpenVINO™ backend and several improvements in the serving functionality.
Changes and improvements
- Updated OpenVINO™ Runtime backend to 2024.0. Link
- Extended text generation demo to support multi batch size both with streaming and unary clients. Link to demo
- Added support for REST client for servables based on MediaPipe graphs including python pipeline nodes. Link to demo
- Added additional MediaPipe calculators which can be reused for multiple image analysis scenarios. Link to new calculators
- Added support for models with a
string
input data type including tokenization extension. Link to demo - Security related updates in versions of included dependencies.
Deprecation notices
Batch Size AUTO and Shape AUTO are deprecated and will be removed.
Use Dynamic Model Shape feature instead.
Breaking changes
No breaking changes.
Bug fixes
- Improvements in error handling for invalid requests and incorrect configuration
- Minor fixes in the demos and documentation
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2024.0
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2024.0-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.3
The 2023.3 is a major release with added a new feature and numerous improvements.
Changes and improvements
-
Included a set of new demos using custom nodes as a python code. They include LLM text generation, stable diffusion and seq2seq translation.
-
Improvements in the demo highlighting video stream analysis. A simple client example can now process the video stream from a local camera, video file or RTSP stream. The data can be sent to the model server via unary gRPC calls or gRPC streaming.
-
Changes in the public release artifacts – the base image of the public model server images is now updated to Ubuntu 22.04 and RHEL 8.8. Public docker images include support for python custom nodes but without custom python dependencies. The public binary distribution of the model server is targeted also on Ubuntu 22.04 and RHEL 8.8 but without python support (it can be deployed on bare metal hosts without python installed). Check building from source guide.
-
Improvements in the documentation https://docs.openvino.ai/2023.3/ovms_what_is_openvino_model_server.html
New Features (Preview)
- Added support for serving MediaPipe graphs with custom nodes implemented as a python code. It greatly simplifies exposing GenAI algorithms based on Hugging Face and Optimum libraries. It can be also applied for arbitrary pre and post processing for the AI solutions. Learn more about it
Stable Feature
gRPC streaming support is out of preview and considered stable.
Breaking changes
No breaking changes.
Deprecation notices
Batch Size AUTO and Shape AUTO are deprecated and will be removed.
Use Dynamic Model Shape feature instead.
Bug fixes
-
OVMS handles boolean parameters to plugin config now #2197
-
Sporadic failures in the IrisTracking demo using gRPC stream are fixed #2161
-
Fixed handling of the incorrect MediaPipe graphs producing multiple outputs with the same name #2161
You can use an OpenVINO Model Server public Docker images based on Ubuntu via the following command:
docker pull openvino/model_server:2023.3
- CPU device support with the image based on Ubuntu22.04
docker pull openvino/model_server:2023.3-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.2
The 2023.2 is a major release with several new features and improvements.
Changes
- Updated OpenVINO backend to version 2023.2.
- MediaPipe framework has been updated to the current latest version 0.10.3.
- Model API used in the OpenVINO Inference MediaPipe Calculator has been updated and included with all its features.
New Features
- Introduced extension of KServe gRPC API with a stream on input and output. That extension is enabled for the servables with MediaPipe graphs. MediaPipe graph is persistent in the scope of the user session. That improves processing performance and supports stateful graphs – for example tracking algorithms. It also enables the use of source calculators. Check more details.
- Added a demo showcasing gRPC streaming with MediaPipe graph. Check more details.
- Added parameters for gRPC quota configuration and changed default gRPC channel arguments to add rate limits. It will minimize the risks of impact of the service from uncontrolled flow of requests. Check more details.
- Updated python clients requirements to match wide range of python versions from 3.7 to 3.11
Breaking changes
No breaking changes.
Bug fixes
- Handling situation when MediaPipe graph is being added with the same name as previously loaded DAG.
- Fixed returned HTTP status code when MediaPipe graph/DAG is not loaded yet. (previously 404, now 503)
- Corrected error message returned via HTTP when using method other than GET for metadata endpoint - "Unsupported method".
You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2023.2
- CPU device support with the image based on Ubuntu20.04
docker pull openvino/model_server:2023.2-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.1
OpenVINO™ Model Server 2023.1
The 2023.1 is a major release with numerous improvements and changes.
New Features
- Improvements in Model Server with MediaPipe integration. In the previous version MediaPipe scheduler was included in OpenVINO Model Server as a preview. Now, the MediaPipe graph scheduler is added by default and officially supported. Check mediapipe in the model server documentation. This release includes the following improvements in running requests calls to the graphs:
GetModelMetadata
implementation for MediaPipe graphs – the calls to model metadata returns information about the expected inputs and outputs names from the graph with the limitation on shape and datatype- Support for data serialization and deserialization to a range of types:
ov::Tensor
,mediapipe::Image
, KServe ModelInfer Request/Response – those capabilities simplify adoption of the existing graphs which might have on the input and output the expected data in many different formats. Now the data submitted to the KServe endpoint can be automatically deserialized to the expected type. The deserialization function is determined based on the naming convention in the graph input and output tags in the graphs config. Check more details. OpenVINOInferenceCalculator
support for a range of input formats fromov::Tensor
totensorflow::Tensor
andTfLite::Tensor
- theOpenVINOInferenceCalculator
has been created as a replacement for Tensorflow calculators. It can accept the input data and returns the data with a range of possible formats. That simplifies just swapping inference related nodes in the existing graphs without changing the rest of the graph. Learn more about the calculators- Added demos based on MediaPipe upstream graphs: holistic sensory analysis, object detection, iris detection
- Improvements in C-API interface:
- Added
OVMS_ApiVersion
call - Added support for C-API calls to DAG pipelines
- Changed data type in API calls for data shape from
uint64_t
toint64_t
anddimCount
fromuint32_t
tosize_t
, this is breaking change - Added a call to servable (model, DAG) metadata and state
- Added a call to get ServerMetadata
- Added
- Improvements in error handling
- Improvements in GRPC and REST status codes - the error statuses will include more meaningful and accurate info about the culprit
- Support for models with scalars on input (empty shape) - model server can be used with models even with input shape represented by an empty list
[]
(scalar). - Support for input with zero size dimensions - model server can now accept requests to dynamic shape models even with
0
size like[0,234]
- Added support for TFLite models - OpenVINO Model Server can not directly serve models with
.tflite
extension - Demo improvements:
- Added Video streaming demos - text detection and holistic pose tracking
- Stable diffusion demo
- MediaPipe demos
Breaking changes
- Changed few of the C-API functions names. Check this commit
Bug fixes
- Fix REST status code when the improper path is requested
- metrics endpoint now returns correct response even with unsupported parameters
You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2023.1
- CPU device support with the image based on Ubuntu20.04
docker pull openvino/model_server:2023.1-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2023.0
The 2023.0 is a major release with numerous improvements and changes.
New Features
- Added option to submit inference requests in a form of strings and reading the response also in a form of a string. That can be currently utilized via a custom nodes and OpenVINO models with a CPU extension handling string data:
- Using a custom node in a DAG pipeline which can perform string tokenization before passing it to the OpenVINO model - that is beneficial for models without tokenization layer to fully delegate that preprocessing to the model server.
- Using a custom node in a DAG pipeline which can perform string detokenization of the model response to convert it to a string format - that can be beneficial for models without detokenization layer to fully delegate that postprocessing to the model server.
- Both options above are demonstrated with a GPT model for text generation demo.
- For models with tokenization layer like universal-sentence-encoder - there is added a cpu extension which implements sentencepiece_tokenization layer. Users can pass to the model a string which is automatically converted to the format needed by the cpu extension.
- The option above is demonstrated in universal-sentence-encoder model usage demo.
- Added support for string input and output in the ovmsclient –
ovmsclient
library can be used to send the string data to the model server. Check the code snippets.
- Preview version of OVMS with MediaPipe framework - it is possible to make calls to OpenVINO Model Server to perform mediapipe graph processing. There are calculators performing OpenVINO inference via C-API calls from OpenVINO Model Server, and also calculators converting the OV::Tensor input format to mediapipe image format. That creates a foundation for creating arbitrary graphs. Check model server integration with mediapipe documentation.
- Extended C-API interface with ApiVersion and Metadata calls, C-API version is now 0.3.
- Added support for saved_model format. Check how to create models repository. An example of such use case is in universal-sentence-encoder demo.
- Added option to build the model server with NVIDIA plugin on UBI8 base image.
- Virtual plugins AUTO, HETERO and MULTI are now supported with NVIDIA plugin.
- In the DEBUG log_level, there is included a message about the actual execution device for each inference request for the AUTO target_device. Learn more about the AUTO plugin.
- Support for relative paths to the model files. The paths can be now relative to the config.json location. It simplifies deployments when the config.json to distributed together with the models repository.
- Updated OpenCL drivers for the GPU device to version 23.13 (with Ubuntu22.04 base image).
- Added option to build OVMS on the base OS Ubuntu:22.04. This is an addition to the supported based OSes Ubuntu:20.04 and UBI8.7.
Breaking changes
- KServe API unification with Triton implementation for handling string and encoded images formats (now every string or encoded image located in binary extension (REST) or raw_input_contents (GRPC) need to be preceded by 4 bytes (little endian) containing its size) The updated code snippets and samples.
- Changed default performance hint from THROUGHPUT to LATENCY in 2023.0 the default performance hint is changed from THROUGHPUT to LATENCY. With the new default settings, the model server will be adjusted for optimal execution and minimal latency with low concurrency. The default setting will also minimize memory consumption. In case of the usage model with high concurrency, it is recommended to adjust the NUM_STREAMS or set the performance hint to THROUGHPUT explicitly. Read more in performance tuning guide.
Bug fixes
- AUTO plugin starts serving models on CPU and switch to GPU device after the model is compiled – it reduces the startup time for the model.
- Fixed image building error on MacOS and Ubuntu22.
- Ovmsclient python library compatible with tensorflow in the same environment –
ovmsclient
is generally created to avoid the requirement oftensorflow
package installation to create smaller python environment. Now thetensorflow
package will not be conflicting so it is fully optional. - Improved memory handling after unloading the models – the model server will not force releasing the memory after models unloading. Memory consumption reported by the model server process will be smaller in use case, when the models are frequently changed.
You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2023.0
- CPU device support with the image based on Ubuntu20.04
docker pull openvino/model_server:2023.0-gpu
- GPU and CPU device support with the image based on Ubuntu22.04
or use provided binary packages.
The prebuilt image is available also on RedHat Ecosystem Catalog
OpenVINO™ Model Server 2022.3.0.1
The 2022.3.0.1 version is a patch release for the OpenVINO Model Server. It includes a few bug fixes and enhancement in the C-API.
New Features
- Added to inference execution method OVMS_Inference in C API support for DAG pipelines. The parameter servableName can be both the model name or the pipeline name
- Added debug log in the AUTO plugin execution to report which physical device is used - AUTO plugin allocates the best available device for the model execution. For troubleshooting purposes, in the debug log level, the model server will report which device is used for each inference execution
- Allowed enabling metrics collection via CLI parameters while using the configuration file. Metrics collection can be configured in CLI parameters or in the configuration file. Enabling the metrics in CLI is not blocking any more the usage of configuration file to define multiple models for serving.
- Added client sample in Java to demonstrate KServe API usage .
- Added client sample in Go to demonstrate KServe API usage.
- Added client samples demonstrating asynchronous calls via KServe API.
- Added a demo showcasing OVMS with GPT-J-6b model from Hugging Face.
Bug fixes
- Fixed model server image building with NVIDIA plugin on a host with NVIDIA Container Toolkit installed.
- Fixed KServe API response to include the DAG pipeline name for the calls to DAG – based on the API definition, the response includes the servable name. In case of DAG processing, it will return now the pipeline name instead of an empty value.
- Default number of gRPC and REST workers will be calculated correctly based on allocated CPU cores – when the model server is started in the docker container with constrained CPU allocation, the default number of the frontend threads will be set more efficiently.
- Corrected reporting the number of streams in the metrics while using non-CPU plugins – before fixing that bug, a zero value was returned. That metric suggests the optimal number of active parallel inferences calls for the best throughput performance.
- Fixed handling model mapping with model reloads.
- Fixed handling model mapping with dynamic shape/batch size.
- ovmsclient is not causing conflicts with tensorflow-serving-api package installation in the same python environment.
- Fixed debug image building.
- Fixed C-API demo building.
- Added security fixes.
Other changes:
- Updated OpenCV version to 4.7 - opencv is an included dependence for image transformation in the custom nodes and for jpeg/png input decoding.
- Lengthened requests waiting timeout during DAG reloads. On slower machines during DAG configuration reload sporadically timeout was reached ending in unsuccessful request.
- ovmsclient has more relaxed requirements related to numpy version.
- Improved unit tests stability.
- Improved documentation.
You can use an OpenVINO Model Server public Docker image's based on Ubuntu via the following command:
docker pull openvino/model_server:2022.3.0.1
docker pull openvino/model_server:2022.3.0.1-gpu
or use provided binary packages.