Skip to content

Latest commit

 

History

History
487 lines (380 loc) · 17.5 KB

inference_rest.md

File metadata and controls

487 lines (380 loc) · 17.5 KB

HTTP/REST

The HTTP/REST API uses JSON because it is widely supported and language independent. In all JSON schemas shown in this document $number, $string, $boolean, $object and $array refer to the fundamental JSON types. #optional indicates an optional JSON field. Inference Request Examples

See also: The HTTP/REST endpoints are defined in open_inference_rest.yaml

API Verb Path Request Payload Response Payload
Inference POST v2/models/<model_name>[/versions/<model_version>]/infer $inference_request $inference_response
Model Metadata GET v2/models/<model_name>[/versions/<model_version>] $metadata_model_response
Server Ready GET v2/health/ready $ready_server_response
Server Live GET v2/health/live $live_server_response
Server Metadata GET v2 $metadata_server_response
Model Ready GET v2/models/<model_name>[/versions/<model_version>]/ready $ready_model_response

** path contents in [] are optional

For more information regarding payload contents, see Payload Contents.

The versions portion of the Path URLs (in []) is shown as optional to allow implementations that don’t support versioning or for cases when the user does not want to specify a specific model version (in which case the server will choose a version based on its own policies). For example, if a model does not implement a version, the Model Metadata request path could look like v2/model/my_model. If the model has been configured to implement a version, the request path could look something like v2/models/my_model/versions/v10, where the version of the model is v10.

API Definitions

API Definition
Inference The /infer endpoint performs inference on a model. The response is the prediction result.
Model Metadata The "model metadata" API is a per-model endpoint that returns details about the model passed in the path.
Server Ready The “server ready” health API indicates if all the models are ready for inferencing. The “server ready” health API can be used directly to implement the Kubernetes readinessProbe
Server Live The “server live” health API indicates if the inference server is able to receive and respond to metadata and inference requests. The “server live” API can be used directly to implement the Kubernetes livenessProbe.
Server Metadata The "server metadata" API returns details describing the server.
Model Ready The “model ready” health API indicates if a specific model is ready for inferencing. The model name and (optionally) version must be available in the URL.

Health/Readiness/Liveness Probes

The Model Readiness probe the question "Was the model successfully downloaded and loaded onto the server to be able to run inference requests?" and responds with the available model name(s). The Server Readiness/Liveness probes answer the question "Is my service and its infrastructure running, healthy, and able to receive and process requests?"

To read more about liveness and readiness probe concepts, visit the Configure Liveness, Readiness and Startup Probes Kubernetes documentation.

Payload Contents

Model Ready

The model ready endpoint returns the readiness probe response for the server along with the name of the model.

Model Ready Response JSON Object

$ready_model_response =
{
  "name" : $string,
  "ready": $bool
}

Server Ready

The server ready endpoint returns the readiness probe response for the server.

Server Ready Response JSON Object

$ready_server_response =
{
  "live" : $bool,
}

Server Live

The server live endpoint returns the liveness probe response for the server.

Server Live Response JSON Objet

$live_server_response =
{
  "live" : $bool,
}

Server Metadata

The server metadata endpoint provides information about the server. A server metadata request is made with an HTTP GET to a server metadata endpoint. In the corresponding response the HTTP body contains the Server Metadata Response JSON Object or the Server Metadata Response JSON Error Object.

Server Metadata Response JSON Object

A successful server metadata request is indicated by a 200 HTTP status code. The server metadata response object, identified as $metadata_server_response, is returned in the HTTP body.

$metadata_server_response =
{
  "name" : $string,
  "version" : $string,
  "extensions" : [ $string, ... ]
}
  • “name” : A descriptive name for the server.
  • "version" : The server version.
  • “extensions” : The extensions supported by the server. Currently, no standard extensions are defined. Individual inference servers may define and document their own extensions.

Server Metadata Response JSON Error Object

A failed server metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_server_error_response object.

$metadata_server_error_response =
{
  "error": $string
}
  • “error” : The descriptive message for the error.

Model Metadata

The per-model metadata endpoint provides information about a model. A model metadata request is made with an HTTP GET to a model metadata endpoint. In the corresponding response the HTTP body contains the Model Metadata Response JSON Object or the Model Metadata Response JSON Error Object. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

Model Metadata Response JSON Object

A successful model metadata request is indicated by a 200 HTTP status code. The metadata response object, identified as $metadata_model_response, is returned in the HTTP body for every successful model metadata request.

$metadata_model_response =
{
  "name" : $string,
  "versions" : [ $string, ... ] #optional,
  "platform" : $string,
  "inputs" : [ $metadata_tensor, ... ],
  "outputs" : [ $metadata_tensor, ... ]
}
  • “name” : The name of the model.
  • "versions" : The model versions that may be explicitly requested via the appropriate endpoint. Optional for servers that don’t support versions. Optional for models that don’t allow a version to be explicitly requested.
  • “platform” : The framework/backend for the model. See Platforms.
  • “inputs” : The inputs required by the model.
  • “outputs” : The outputs produced by the model.

Each model input and output tensors’ metadata is described with a $metadata_tensor object.

$metadata_tensor =
{
  "name" : $string,
  "datatype" : $string,
  "shape" : [ $number, ... ]
}
  • “name” : The name of the tensor.
  • "datatype" : The data-type of the tensor elements as defined in Tensor Data Types.
  • "shape" : The shape of the tensor. Variable-size dimensions are specified as -1.

Model Metadata Response JSON Error Object

A failed model metadata request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $metadata_model_error_response object.

$metadata_model_error_response =
{
  "error": $string
}
  • “error” : The descriptive message for the error.

Platforms

A platform is a string indicating a DL/ML framework or backend. Platform is returned as part of the response to a Model Metadata request but is information only. The proposed inference APIs are generic relative to the DL/ML framework used by a model and so a client does not need to know the platform of a given model to use the API. Platform names use the format “_”. The following platform names are allowed:

  • tensorrt_plan : A TensorRT model encoded as a serialized engine or “plan”.
  • tensorflow_graphdef : A TensorFlow model encoded as a GraphDef.
  • tensorflow_savedmodel : A TensorFlow model encoded as a SavedModel.
  • onnx_onnxv1 : A ONNX model encoded for ONNX Runtime.
  • pytorch_torchscript : A PyTorch model encoded as TorchScript.
  • mxnet_mxnet: An MXNet model
  • caffe2_netdef : A Caffe2 model encoded as a NetDef.

Inference

An inference request is made with an HTTP POST to an inference endpoint. In the request the HTTP body contains the Inference Request JSON Object. In the corresponding response the HTTP body contains the Inference Response JSON Object or Inference Response JSON Error Object. See Inference Request Examples for some example HTTP/REST requests and responses.

Inference Request JSON Object

The inference request object, identified as $inference_request, is required in the HTTP body of the POST request. The model name and (optionally) version must be available in the URL. If a version is not provided the server may choose a version based on its own policies or return an error.

$inference_request =
{
  "id" : $string #optional,
  "parameters" : $parameters #optional,
  "inputs" : [ $request_input, ... ],
  "outputs" : [ $request_output, ... ] #optional
}
  • "id" : An identifier for this request. Optional, but if specified this identifier must be returned in the response.
  • "parameters" : An object containing zero or more parameters for this inference request expressed as key/value pairs. See Parameters for more information.
  • "inputs" : The input tensors. Each input is described using the $request_input schema defined in Request Input.
  • "outputs" : The output tensors requested for this inference. Each requested output is described using the $request_output schema defined in Request Output. Optional, if not specified all outputs produced by the model will be returned using default $request_output settings.
Request Input

The $inference_request_input JSON describes an input to the model. If the input is batched, the shape and data must represent the full shape and contents of the entire batch.

$inference_request_input =
{
  "name" : $string,
  "shape" : [ $number, ... ],
  "datatype"  : $string,
  "parameters" : $parameters #optional,
  "data" : $tensor_data
}
  • "name" : The name of the input tensor.
  • "shape" : The shape of the input tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
  • "datatype" : The data-type of the input tensor elements as defined in Tensor Data Types.
  • "parameters" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
  • "data": The contents of the tensor. See Tensor Data for more information.
Request Output

The $request_output JSON is used to request which output tensors should be returned from the model.

$inference_request_output =
{
  "name" : $string,
  "parameters" : $parameters #optional,
}
  • "name" : The name of the output tensor.
  • "parameters" : An object containing zero or more parameters for this output expressed as key/value pairs. See Parameters for more information.

Inference Response JSON Object

A successful inference request is indicated by a 200 HTTP status code. The inference response object, identified as $inference_response, is returned in the HTTP body.

$inference_response =
{
  "model_name" : $string,
  "model_version" : $string #optional,
  "id" : $string,
  "parameters" : $parameters #optional,
  "outputs" : [ $response_output, ... ]
}
  • "model_name" : The name of the model used for inference.
  • "model_version" : The specific model version used for inference. Inference servers that do not implement versioning should not provide this field in the response.
  • "id" : The "id" identifier given in the request, if any.
  • "parameters" : An object containing zero or more parameters for this response expressed as key/value pairs. See Parameters for more information.
  • "outputs" : The output tensors. Each output is described using the $response_output schema defined in Response Output.
Response Output

The $response_output JSON describes an output from the model. If the output is batched, the shape and data represents the full shape of the entire batch.

$response_output =
{
  "name" : $string,
  "shape" : [ $number, ... ],
  "datatype"  : $string,
  "parameters" : $parameters #optional,
  "data" : $tensor_data
}
  • "name" : The name of the output tensor.
  • "shape" : The shape of the output tensor. Each dimension must be an integer representable as an unsigned 64-bit integer value.
  • "datatype" : The data-type of the output tensor elements as defined in Tensor Data Types.
  • "parameters" : An object containing zero or more parameters for this input expressed as key/value pairs. See Parameters for more information.
  • “data”: The contents of the tensor. See Tensor Data for more information.

Inference Response JSON Error Object

A failed inference request must be indicated by an HTTP error status (typically 400). The HTTP body must contain the $inference_error_response object.

$inference_error_response =
{
  "error": <error message string>
}
  • “error” : The descriptive message for the error.

Parameters

The $parameters JSON describes zero or more “name”/”value” pairs, where the “name” is the name of the parameter and the “value” is a $string, $number, or $boolean.

$parameters =
{
  $parameter, ...
}

$parameter = $string : $string | $number | $boolean

Currently no parameters are defined. As required a future proposal may define one or more standard parameters to allow portable functionality across different inference servers. A server can implement server-specific parameters to provide non-standard capabilities.

Tensor Data

Tensor data must be presented in row-major order of the tensor elements. Element values must be given in "linear" order without any stride or padding between elements. Tensor elements may be presented in their nature multi-dimensional representation, or as a flattened one-dimensional representation.

Tensor data given explicitly is provided in a JSON array. Each element of the array may be an integer, floating-point number, string or boolean value. The server can decide to coerce each element to the required type or return an error if an unexpected value is received. Note that fp16 and bf16 are problematic to communicate explicitly since there is not a standard fp16/bf16 representation across backends nor typically the programmatic support to create the fp16/bf16 representation for a JSON number.

For example, the 2-dimensional matrix:

[ 1 2
  4 5 ]

Can be represented in its natural format as:

"data" : [ [ 1, 2 ], [ 4, 5 ] ]

Or in a flattened one-dimensional representation:

"data" : [ 1, 2, 4, 5 ]

Tensor Data Types

Tensor data types are shown in the following table along with the size of each type, in bytes.

Data Type Size (bytes)
BOOL 1
UINT8 1
UINT16 2
UINT32 4
UINT64 8
INT8 1
INT16 2
INT32 4
INT64 8
FP16 2
FP32 4
FP64 8
BYTES Variable (max 232)

Inference Request Examples

The following example shows an inference request to a model with two inputs and one output. The HTTP Content-Length header gives the size of the JSON object.

POST /v2/models/mymodel/infer HTTP/1.1
Host: localhost:8000
Content-Type: application/json
Content-Length: <xx>
{
  "id" : "42",
  "inputs" : [
    {
      "name" : "input0",
      "shape" : [ 2, 2 ],
      "datatype" : "UINT32",
      "data" : [ 1, 2, 3, 4 ]
    },
    {
      "name" : "input1",
      "shape" : [ 3 ],
      "datatype" : "BOOL",
      "data" : [ true ]
    }
  ],
  "outputs" : [
    {
      "name" : "output0"
    }
  ]
}

For the above request the inference server must return the “output0” output tensor. Assuming the model returns a [ 3, 2 ] tensor of data type FP32 the following response would be returned.

HTTP/1.1 200 OK
Content-Type: application/json
Content-Length: <yy>
{
  "id" : "42"
  "outputs" : [
    {
      "name" : "output0",
      "shape" : [ 3, 2 ],
      "datatype"  : "FP32",
      "data" : [ 1.0, 1.1, 2.0, 2.1, 3.0, 3.1 ]
    }
  ]
}