Prepack weights when model is loaded #214

robertknight · 2024-05-27T08:21:06Z

Other ML runtimes such as ONNX Runtime and TensorFlow Lite prepack weights for use with the selected matrix multiplication kernel when a model is loaded. This reduces inference latency when a model is run multiple times in a session, at the cost of longer load time.

RTen implements weight prepacking to amortize packing overhead when MatMul or Conv operators are applied to a batch of inputs. However it doesn't prepack inputs when the model is loaded, so packing costs are incurred for each inference.

References:

The text was updated successfully, but these errors were encountered:

robertknight added the performance Issues that affect model inference or loading performance label May 27, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Prepack weights when model is loaded #214

Prepack weights when model is loaded #214

robertknight commented May 27, 2024 •

edited

Loading

Prepack weights when model is loaded #214

Prepack weights when model is loaded #214

Comments

robertknight commented May 27, 2024 • edited Loading

robertknight commented May 27, 2024 •

edited

Loading