Skip to content

Commit

Permalink
Update mediapipe docs and demo (#2422)
Browse files Browse the repository at this point in the history
  • Loading branch information
michalkulakowski authored Apr 25, 2024
1 parent 9030876 commit d576c6c
Show file tree
Hide file tree
Showing 9 changed files with 73 additions and 52 deletions.
4 changes: 3 additions & 1 deletion demos/image_classification_with_string_output/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,8 @@ The script below is downloading a public MobileNet model trained on the ImageNet
This is a very handy functionality because it allows us to export the model with the included pre/post processing functions as the model layers. The client just receives the string data with the label name for the classified image.

```bash
git clone https://github.com/openvinotoolkit/model_server.git
cd model_server/demos/image_classification_with_string_output
pip install -r requirements.txt
python3 download_model.py
rm model/1/fingerprint.pb
Expand All @@ -31,7 +33,7 @@ docker run -d -u $(id -u):$(id -g) -v $(pwd):/workspace -p 8000:8000 openvino/mo
## Send request
Use example client to send requests containing images via KServ REST API:
```bash
python3 image_classification_with_string_output.py
python3 image_classification_with_string_output.py --http_port 8000
```
Request may be sent also using other APIs (KServ GRPC, TFS). In this sections you can find short code samples how to do this:
- [TensorFlow Serving API](../../docs/clients_tfs.md)
Expand Down
31 changes: 31 additions & 0 deletions demos/python_demos/llm_text_generation/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -177,6 +177,37 @@ Time per generated token 20.0 ms
Total time 6822 ms
```
### Use KServe REST API with curl
Run OVMS :
```bash
docker run -d --rm -p 8000:8000 -v ${PWD}/servable_unary:/workspace -v ${PWD}/${SELECTED_MODEL}:/model \
-e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --rest_port 8000
```
Send request using curl:
```bash
curl --header "Content-Type: application/json" --data '{"inputs":[{"name" : "pre_prompt", "shape" : [1], "datatype" : "BYTES", "data" : ["What is the theory of relativity?"]}]}' localhost:8000/v2/models/python_model/infer
```
Example output:
```bash
{
"model_name": "python_model",
"outputs": [{
"name": "token_count",
"shape": [1],
"datatype": "INT32",
"data": [249]
}, {
"name": "completion",
"shape": [1],
"datatype": "BYTES",
"data": ["The theory of relativity is a long-standing field of physics which states that the behavior of matter and energy in relation to space and time is influenced by the principles of special theory of relativity and general theory of relativity. It proposes that gravity is a purely mathematical construct (as opposed to a physical reality), which affects distant masses on superluminal speeds just as they would alter objects on Earth moving at light speed. According to the theory, space and time are more fluid than we perceive them to be, with phenomena like lensing causing distortions that cannot be explained through more traditional laws of physics. Since its introduction in 1905, it has revolutionized the way we understand the world and has shed fresh light on important concepts in modern scientific thought, such as causality, time dilation, and the nature of space-time. The theory was proposed by Albert Einstein in an article published in the British journal 'Philosophical Transactions of the Royal Society A' in 1915, although his findings were first formulated in his 1907 book 'Einstein: Photography & Poetry,' where he introduced the concept of equivalence principle."]
}]
}
```
## Run a client with gRPC streaming
### Deploy OpenVINO Model Server with the Python Calculator
Expand Down
15 changes: 5 additions & 10 deletions demos/python_demos/llm_text_generation/client_stream.py
Original file line number Diff line number Diff line change
Expand Up @@ -61,16 +61,11 @@ def callback(result, error):
elif result.as_numpy('token_count') is not None:
token_count[0] = result.as_numpy('token_count')[0]
elif result.as_numpy('completion') is not None:
if len(prompts) == 1:
# For single batch, partial response is represented as single buffer of bytes
print(result.as_numpy('completion').tobytes().decode(), flush=True, end='')
else:
# For multi batch, responses are packed in 4byte len tritonclient format
os.system('clear')
for i, completion in enumerate(deserialize_bytes_tensor(result._result.raw_output_contents[0])):
completions[i] += completion.decode()
print(completions[i])
print()
os.system('cls' if os.name=='nt' else 'clear')
for i, completion in enumerate(deserialize_bytes_tensor(result._result.raw_output_contents[0])):
completions[i] += completion.decode()
print(completions[i])
print()
duration = int((endtime - start_time).total_seconds() * 1000)
processing_times = np.append(processing_times, duration)
start_time = datetime.datetime.now()
Expand Down
13 changes: 5 additions & 8 deletions demos/python_demos/llm_text_generation/client_unary.py
Original file line number Diff line number Diff line change
Expand Up @@ -39,14 +39,11 @@
start_time = datetime.datetime.now()
results = client.infer("python_model", [infer_input], client_timeout=10*60) # 10 minutes
endtime = datetime.datetime.now()
if len(args['prompt']) == 1:
print(f"Question:\n{args['prompt'][0]}\n\nCompletion:\n{results.as_numpy('completion').tobytes().decode()}\n")
else:
for i, arr in enumerate(deserialize_bytes_tensor(results.as_numpy("completion"))):
if i < len(args['prompt']):
print(f"==== Prompt: {args['prompt'][i]} ====")
print(arr.decode())
print()
for i, arr in enumerate(results.as_numpy("completion")):
if i < len(args['prompt']):
print(f"==== Prompt: {args['prompt'][i]} ====")
print(arr.decode())
print()
print("Number of tokens ", results.as_numpy("token_count")[0])
print("Generated tokens per second ", round(results.as_numpy("token_count")[0] / int((endtime - start_time).total_seconds()), 2))
print("Time per generated token ", round(int((endtime - start_time).total_seconds()) / results.as_numpy("token_count")[0] * 1000, 2), "ms")
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -122,17 +122,13 @@ def convert_history_to_text(history):


def deserialize_prompts(batch_size, input_tensor):
if batch_size == 1:
return [bytes(input_tensor).decode()]
np_arr = deserialize_bytes_tensor(bytes(input_tensor))
return [arr.decode() for arr in np_arr]


def serialize_completions(batch_size, result):
if batch_size == 1:
return [Tensor("completion", result.encode())]
return [Tensor("completion", serialize_byte_tensor(
np.array(result, dtype=np.object_)).item())]
np.array(result, dtype=np.object_)).item(), shape=[batch_size], datatype="BYTES")]


class OvmsPythonModel:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -114,18 +114,13 @@ def convert_history_to_text(history):


def deserialize_prompts(batch_size, input_tensor):
if batch_size == 1:
return [bytes(input_tensor).decode()]
np_arr = deserialize_bytes_tensor(bytes(input_tensor))
return [arr.decode() for arr in np_arr]


def serialize_completions(batch_size, result, token_count):
if batch_size == 1:
return [Tensor("completion", result[0].encode()), Tensor("token_count", np.array(token_count, dtype=np.int32))]
return [Tensor("completion", serialize_byte_tensor(
np.array(result, dtype=np.object_)).item()), Tensor("token_count", np.array(token_count, dtype=np.int32))]

np.array(result, dtype=np.object_)).item(), shape=[batch_size], datatype="BYTES"), Tensor("token_count", np.array(token_count, dtype=np.int32))]

class OvmsPythonModel:
def initialize(self, kwargs: dict):
Expand Down
5 changes: 0 additions & 5 deletions demos/python_demos/llm_text_generation/utils.py
Original file line number Diff line number Diff line change
Expand Up @@ -20,11 +20,6 @@

def serialize_prompts(prompts):
infer_input = grpcclient.InferInput("pre_prompt", [len(prompts)], "BYTES")
if len(prompts) == 1:
# Single batch serialized directly as bytes
infer_input._raw_content = prompts[0].encode()
return infer_input
# Multi batch serialized in tritonclient 4byte len format
infer_input._raw_content = serialize_byte_tensor(
np.array(prompts, dtype=np.object_)).item()
return infer_input
14 changes: 7 additions & 7 deletions docs/mediapipe.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,7 +12,7 @@ MediaPipe is an open-source framework for building pipelines to perform inferenc

Thanks to the integration between MediaPipe and OpenVINO Model Server, the graphs can be exposed over the network and the complete load can be delegated to a remote host or a microservice.
We support the following scenarios:
- stateless execution via unary to unary gRPC calls
- stateless execution via unary to unary gRPC/REST calls
- stateful graph execution via [gRPC streaming sessions](./streaming_endpoints.md).

With the introduction of OpenVINO calculator it is possible to optimize inference execution in the OpenVINO Runtime backend. This calculator can be applied both in the graphs deployed inside the Model Server but also in the standalone applications using the MediaPipe framework.
Expand Down Expand Up @@ -79,10 +79,10 @@ The required data layout for the MediaPipe `IMAGE` conversion is HWC and the sup
|UINT16|1,3,4|
|INT16|1,3,4|

> **Note**: Input serialization to MediaPipe ImageFrame format, requires the data in the KServe request to be encapsulated in `raw_input_contents` field based on [KServe API](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto). That is the default behavior in the client libs like `triton-client`.
> **Note**: Input serialization to MediaPipe ImageFrame format, requires the data in the KServe request to be encapsulated in `raw_input_contents` field based on [KServe API GRPC](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto) or in binary extension of based on [KServe API REST](./binary_input_kfs.md#http). That is the default behavior in the client libs like `triton-client`.
When the client is sending in the gRPC request the input as an numpy array, it will be deserialized on the Model Server side to the format specified in the graph.
For example when the graph has the input type IMAGE, the gRPC client could send the input data with the shape `(300, 300, 3)` and precision INT8. It would not be allowed to send the data in the shape for example `(1,300,300,1)` as that would be incorrect layout and the number of dimensions.
When the client is sending in the gRPC/REST request the input as an numpy array, it will be deserialized on the Model Server side to the format specified in the graph.
For example when the graph has the input type IMAGE, the gRPC/REST client could send the input data with the shape `(300, 300, 3)` and precision INT8. It would not be allowed to send the data in the shape for example `(1,300,300,1)` as that would be incorrect layout and the number of dimensions.

When the input graph would be set as `OVTENSOR`, any shape and precisions of the input would be allowed. It will be converted to `ov::Tensor` object and passed to the graph. For example input can have shape `(1,3,300,300)` and precision `FP32`. If passed tensor would not be accepted by model, calculator and graph will return error.

Expand All @@ -94,7 +94,7 @@ There is also an option to avoid any data conversions in the serialization and d

### Side packets
Side packets are special parameters which can be passed to the calculators at the beginning of the graph initialization. It can tune the behavior of the calculator like set the object detection threshold or number of objects to process.
With KServe gRPC API you are also able to push side input packets into graph. They are to be passed as KServe request parameters. They can be of type `string`, `int64` or `boolean`.
With KServe API you are also able to push side input packets into graph. They are to be passed as KServe request parameters. They can be of type `string`, `int64` or `boolean`.
Note that with the gRPC stream connection, only the first request in the stream can include the side package parameters. On the client side, the snippet below illustrates how it can be defined:
```python
client.async_stream_infer(
Expand Down Expand Up @@ -207,7 +207,7 @@ It can generate the load to gRPC stream and the mediapipe graph based on the con

## Using MediaPipe graphs from the remote client

MediaPipe graphs can use the same gRPC KServe Inference API both for the unary calls and the streaming.
MediaPipe graphs can use the same gRPC/REST KServe Inference API both for the unary calls and the streaming.
The same client libraries with KServe API support can be used in both cases. The client code for the unary and streaming is different.
Check the [code snippets](https://docs.openvino.ai/2024/ovms_docs_clients_kfs.html)

Expand Down Expand Up @@ -261,7 +261,7 @@ in the conditions:default section of the deps property:


## Current limitations
- MediaPipe graphs are supported only for gRPC KServe API.
- Inputs of type string are supported only for inputs tagged as OVMS_PY_TENSOR.

- KServe ModelMetadata call response contains only input and output names. In the response, shapes will be empty and datatypes will be `"INVALID"`.

Expand Down
Loading

0 comments on commit d576c6c

Please sign in to comment.