Update mediapipe docs and demo (#2422)

openvinotoolkit · Apr 25, 2024 · d576c6c · d576c6c
1 parent 9030876
commit d576c6c
Show file tree

Hide file tree

Showing 9 changed files with 73 additions and 52 deletions.
diff --git a/demos/image_classification_with_string_output/README.md b/demos/image_classification_with_string_output/README.md
@@ -8,6 +8,8 @@ The script below is downloading a public MobileNet model trained on the ImageNet
 This is a very handy functionality because it allows us to export the model with the included pre/post processing functions as the model layers. The client just receives the string data with the label name for the classified image.
 
 ```bash
+git clone https://github.com/openvinotoolkit/model_server.git
+cd model_server/demos/image_classification_with_string_output
 pip install -r requirements.txt
 python3 download_model.py
 rm model/1/fingerprint.pb
@@ -31,7 +33,7 @@ docker run -d -u $(id -u):$(id -g) -v $(pwd):/workspace -p 8000:8000 openvino/mo
 ## Send request
 Use example client to send requests containing images via KServ REST API:
 ```bash
-python3 image_classification_with_string_output.py 
+python3 image_classification_with_string_output.py --http_port 8000
 ```
 Request may be sent also using other APIs (KServ GRPC, TFS). In this sections you can find short code samples how to do this:
 - [TensorFlow Serving API](../../docs/clients_tfs.md)

diff --git a/demos/python_demos/llm_text_generation/README.md b/demos/python_demos/llm_text_generation/README.md
@@ -177,6 +177,37 @@ Time per generated token  20.0 ms
 Total time 6822 ms
 ```
 
+### Use KServe REST API with curl
+
+Run OVMS :
+```bash
+docker run -d --rm -p 8000:8000 -v ${PWD}/servable_unary:/workspace -v ${PWD}/${SELECTED_MODEL}:/model \
+-e SELECTED_MODEL=${SELECTED_MODEL} openvino/model_server:py --config_path /workspace/config.json --rest_port 8000
+```
+
+Send request using curl:
+```bash
+curl --header "Content-Type: application/json" --data '{"inputs":[{"name" : "pre_prompt", "shape" : [1], "datatype" : "BYTES", "data" : ["What is the theory of relativity?"]}]}' localhost:8000/v2/models/python_model/infer
+```
+
+Example output:
+```bash
+{
+    "model_name": "python_model",
+    "outputs": [{
+            "name": "token_count",
+            "shape": [1],
+            "datatype": "INT32",
+            "data": [249]
+        }, {
+            "name": "completion",
+            "shape": [1],
+            "datatype": "BYTES",
+            "data": ["The theory of relativity is a long-standing field of physics which states that the behavior of matter and energy in relation to space and time is influenced by the principles of special theory of relativity and general theory of relativity. It proposes that gravity is a purely mathematical construct (as opposed to a physical reality), which affects distant masses on superluminal speeds just as they would alter objects on Earth moving at light speed. According to the theory, space and time are more fluid than we perceive them to be, with phenomena like lensing causing distortions that cannot be explained through more traditional laws of physics. Since its introduction in 1905, it has revolutionized the way we understand the world and has shed fresh light on important concepts in modern scientific thought, such as causality, time dilation, and the nature of space-time. The theory was proposed by Albert Einstein in an article published in the British journal 'Philosophical Transactions of the Royal Society A' in 1915, although his findings were first formulated in his 1907 book 'Einstein: Photography & Poetry,' where he introduced the concept of equivalence principle."]
+        }]
+}
+```
+
 ## Run a client with gRPC streaming
 
 ### Deploy OpenVINO Model Server with the Python Calculator

diff --git a/demos/python_demos/llm_text_generation/client_stream.py b/demos/python_demos/llm_text_generation/client_stream.py
@@ -61,16 +61,11 @@ def callback(result, error):
     elif result.as_numpy('token_count') is not None:
         token_count[0] = result.as_numpy('token_count')[0]
     elif result.as_numpy('completion') is not None:
-        if len(prompts) == 1:
-            # For single batch, partial response is represented as single buffer of bytes
-            print(result.as_numpy('completion').tobytes().decode(), flush=True, end='')
-        else:
-            # For multi batch, responses are packed in 4byte len tritonclient format
-            os.system('clear')
-            for i, completion in enumerate(deserialize_bytes_tensor(result._result.raw_output_contents[0])):
-                completions[i] += completion.decode()
-                print(completions[i])
-                print()
+        os.system('cls' if os.name=='nt' else 'clear')
+        for i, completion in enumerate(deserialize_bytes_tensor(result._result.raw_output_contents[0])):
+            completions[i] += completion.decode()
+            print(completions[i])
+            print()
         duration = int((endtime - start_time).total_seconds() * 1000)
         processing_times = np.append(processing_times, duration)
         start_time = datetime.datetime.now()

diff --git a/demos/python_demos/llm_text_generation/client_unary.py b/demos/python_demos/llm_text_generation/client_unary.py
@@ -39,14 +39,11 @@
 start_time = datetime.datetime.now()
 results = client.infer("python_model", [infer_input], client_timeout=10*60)  # 10 minutes
 endtime = datetime.datetime.now()
-if len(args['prompt']) == 1:
-    print(f"Question:\n{args['prompt'][0]}\n\nCompletion:\n{results.as_numpy('completion').tobytes().decode()}\n")
-else:
-    for i, arr in enumerate(deserialize_bytes_tensor(results.as_numpy("completion"))):
-        if i < len(args['prompt']):
-            print(f"==== Prompt: {args['prompt'][i]} ====")
-            print(arr.decode())
-        print()
+for i, arr in enumerate(results.as_numpy("completion")):
+    if i < len(args['prompt']):
+        print(f"==== Prompt: {args['prompt'][i]} ====")
+        print(arr.decode())
+    print()
 print("Number of tokens ", results.as_numpy("token_count")[0])
 print("Generated tokens per second ", round(results.as_numpy("token_count")[0] / int((endtime - start_time).total_seconds()), 2))
 print("Time per generated token ", round(int((endtime - start_time).total_seconds()) / results.as_numpy("token_count")[0] * 1000, 2), "ms")

diff --git a/demos/python_demos/llm_text_generation/servable_stream/model.py b/demos/python_demos/llm_text_generation/servable_stream/model.py
@@ -122,17 +122,13 @@ def convert_history_to_text(history):
 
 
 def deserialize_prompts(batch_size, input_tensor):
-    if batch_size == 1:
-        return [bytes(input_tensor).decode()]
     np_arr = deserialize_bytes_tensor(bytes(input_tensor))
     return [arr.decode() for arr in np_arr]
 
 
 def serialize_completions(batch_size, result):
-    if batch_size == 1:
-        return [Tensor("completion", result.encode())]
     return [Tensor("completion", serialize_byte_tensor(
-        np.array(result, dtype=np.object_)).item())]
+        np.array(result, dtype=np.object_)).item(), shape=[batch_size], datatype="BYTES")]
 
 
 class OvmsPythonModel:

diff --git a/demos/python_demos/llm_text_generation/servable_unary/python_model/model.py b/demos/python_demos/llm_text_generation/servable_unary/python_model/model.py
@@ -114,18 +114,13 @@ def convert_history_to_text(history):
 
 
 def deserialize_prompts(batch_size, input_tensor):
-    if batch_size == 1:
-        return [bytes(input_tensor).decode()]
     np_arr = deserialize_bytes_tensor(bytes(input_tensor))
     return [arr.decode() for arr in np_arr]
 
 
 def serialize_completions(batch_size, result, token_count):
-    if batch_size == 1:
-        return [Tensor("completion", result[0].encode()), Tensor("token_count", np.array(token_count, dtype=np.int32))]
     return [Tensor("completion", serialize_byte_tensor(
-        np.array(result, dtype=np.object_)).item()), Tensor("token_count", np.array(token_count, dtype=np.int32))]
-
+        np.array(result, dtype=np.object_)).item(), shape=[batch_size], datatype="BYTES"), Tensor("token_count", np.array(token_count, dtype=np.int32))]
 
 class OvmsPythonModel:
     def initialize(self, kwargs: dict):

diff --git a/demos/python_demos/llm_text_generation/utils.py b/demos/python_demos/llm_text_generation/utils.py
@@ -20,11 +20,6 @@
 
 def serialize_prompts(prompts):
     infer_input = grpcclient.InferInput("pre_prompt", [len(prompts)], "BYTES")
-    if len(prompts) == 1:
-        # Single batch serialized directly as bytes
-        infer_input._raw_content = prompts[0].encode()
-        return infer_input
-    # Multi batch serialized in tritonclient 4byte len format
     infer_input._raw_content = serialize_byte_tensor(
         np.array(prompts, dtype=np.object_)).item()
     return infer_input
diff --git a/docs/mediapipe.md b/docs/mediapipe.md
@@ -12,7 +12,7 @@ MediaPipe is an open-source framework for building pipelines to perform inferenc
 
 Thanks to the integration between MediaPipe and OpenVINO Model Server, the graphs can be exposed over the network and the complete load can be delegated to a remote host or a microservice.
 We support the following scenarios:
-- stateless execution via unary to unary gRPC calls
+- stateless execution via unary to unary gRPC/REST calls
 - stateful graph execution via [gRPC streaming sessions](./streaming_endpoints.md).
 
 With the introduction of OpenVINO calculator it is possible to optimize inference execution in the OpenVINO Runtime backend. This calculator can be applied both in the graphs deployed inside the Model Server but also in the standalone applications using the MediaPipe framework.
@@ -79,10 +79,10 @@ The required data layout for the MediaPipe `IMAGE` conversion is HWC and the sup
 |UINT16|1,3,4|
 |INT16|1,3,4|
 
-> **Note**: Input serialization to MediaPipe ImageFrame format, requires the data in the KServe request to be encapsulated in `raw_input_contents` field based on [KServe API](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto). That is the default behavior in the client libs like `triton-client`.
+> **Note**: Input serialization to MediaPipe ImageFrame format, requires the data in the KServe request to be encapsulated in `raw_input_contents` field based on [KServe API GRPC](https://github.com/kserve/kserve/blob/master/docs/predict-api/v2/grpc_predict_v2.proto) or in binary extension of based on [KServe API REST](./binary_input_kfs.md#http). That is the default behavior in the client libs like `triton-client`.
 
-When the client is sending in the gRPC request the input as an numpy array, it will be deserialized on the Model Server side to the format specified in the graph.
-For example when the graph has the input type IMAGE, the gRPC client could send the input data with the shape `(300, 300, 3)` and precision INT8. It would not be allowed to send the data in the shape for example `(1,300,300,1)` as that would be incorrect layout and the number of dimensions.
+When the client is sending in the gRPC/REST request the input as an numpy array, it will be deserialized on the Model Server side to the format specified in the graph.
+For example when the graph has the input type IMAGE, the gRPC/REST client could send the input data with the shape `(300, 300, 3)` and precision INT8. It would not be allowed to send the data in the shape for example `(1,300,300,1)` as that would be incorrect layout and the number of dimensions.
 
 When the input graph would be set as `OVTENSOR`, any shape and precisions of the input would be allowed. It will be converted to `ov::Tensor` object and passed to the graph. For example input can have shape `(1,3,300,300)` and precision `FP32`. If passed tensor would not be accepted by model, calculator and graph will return error.
 
@@ -94,7 +94,7 @@ There is also an option to avoid any data conversions in the serialization and d
 
 ### Side packets
 Side packets are special parameters which can be passed to the calculators at the beginning of the graph initialization. It can tune the behavior of the calculator like set the object detection threshold or number of objects to process.
-With KServe gRPC API you are also able to push side input packets into graph. They are to be passed as KServe request parameters. They can be of type `string`, `int64` or `boolean`.
+With KServe API you are also able to push side input packets into graph. They are to be passed as KServe request parameters. They can be of type `string`, `int64` or `boolean`.
 Note that with the gRPC stream connection, only the first request in the stream can include the side package parameters. On the client side, the snippet below illustrates how it can be defined:
 ```python
 client.async_stream_infer(
@@ -207,7 +207,7 @@ It can generate the load to gRPC stream and the mediapipe graph based on the con
 
 ## Using MediaPipe graphs from the remote client
 
-MediaPipe graphs can use the same gRPC KServe Inference API both for the unary calls and the streaming.
+MediaPipe graphs can use the same gRPC/REST KServe Inference API both for the unary calls and the streaming.
 The same client libraries with KServe API support can be used in both cases. The client code for the unary and streaming is different.
 Check the [code snippets](https://docs.openvino.ai/2024/ovms_docs_clients_kfs.html)
 
@@ -261,7 +261,7 @@ in the conditions:default section of the deps property:
 
 
 ## Current limitations
-- MediaPipe graphs are supported only for gRPC KServe API.
+- Inputs of type string are supported only for inputs tagged as OVMS_PY_TENSOR.
 
 - KServe ModelMetadata call response contains only input and output names. In the response, shapes will be empty and datatypes will be `"INVALID"`.