A NN-Based Cost Model for VPU Devices. For additional information about model setup and training, please refer this paper
If you find this work useful, please cite the following paper:
@article{DBLP:journals/corr/abs-2205-04586,
doi = {10.48550/ARXIV.2205.04586},
url = {https://arxiv.org/abs/2205.04586},
author = {Hunter, Ian Frederick Vigogne Goodbody and Palla, Alessandro and Nagy, Sebastian Eusebiu and Richmond, Richard and McAdoo, Kyle},
title = {Towards Optimal VPU Compiler Cost Modeling by using Neural Networks to Infer Hardware Performances},
publisher = {arXiv},
year = {2022},
copyright = {arXiv.org perpetual, non-exclusive license}
}
GCC version should be > 9. You can check your GCC version by running gcc --version
and g++ --version
If you do not set CC and CXX environment variables, which gcc
and which g++
are used by default.
Compile the library by typing cmake -H. -Bbuild && cmake --build build
@TODO: environment compatible with newer compiler versions (gcc>=10, clamg >10 )
Install oneAPI base Toolkit (instructions). oneAPI is massive so feel free to install only the Math Kernel Library library.
If you have troubles with proxy, please export no_proxy=127.0.0.1
in order to bypass any no_proxy env vs *.intel.com
urls
To enable MKL you need to source this file /opt/intel/oneapi/setvars.sh
to set the appropriate environment variables. Look here on how to get started with VSC
You can select which BLAS library to use (assume you have MKL installed) and the threading mode by using the following cmake variables
-DCBLAS_LIB=<value>
(options:mkl
for oneMKL andopenblas
for OpenBLAS)-DMKL_THREADING=<value>
(options:tbb
for oneAPI Threading Building Blocks andsequential
for no threading)
To use the VPUN cost model in a cmake project is quite simple. An example of a CMakeLists.txt file is shown below
include_directories(${CMAKE_BINARY_DIR}/include)
include_directories(${FLATBUFFERS_SRC_DIR}/include)
...
target_link_libraries(<your exe or lib> inference)
The following example code explains how to instantiate the cost model and how to run a simple query for a 3x3s1 convolution
#include "vpu_cost_model.h"
auto model = VPUNN::VPUCostModel(model_path);
auto dpu_cycles = model.DPU({VPUNN::VPUDevice::VPU_2_7,
VPUNN::Operation::CONVOLUTION,
{VPUNN::VPUTensor(56, 56, 16, 1, VPUNN::DataType::UINT8)}, // input dimensions
{VPUNN::VPUTensor(56, 56, 16, 1, VPUNN::DataType::UINT8)}, // output dimensions
{3, 3}, //kernels
{1, 1}, //strides
{1, 1}, //padding
VPUNN::ExecutionMode::CUBOID_16x16} // execution mode
);
The example
folder contains few examples on how to build and use the cost model in a C++ project. The following list is a WIP of the supported example:
workload_mode_selection
:- Selecting the optimal MPE mode for a VPU_2_0 workload
- Choosing the optimal workload split strategy amound multiple ones
You can install the library by typing pip install .
Do this in a python virtual environment.
Run the vpu_cost_model
script to evaluate workloads from the command line
usage: vpu_cost_model [-h] --model MODEL [-t {cycles,power,utilization}] {VPU_2_7,VPU_4_0} ...
VPU cost model
positional arguments:
{VPU_2_7,VPU_4_0}
options:
-h, --help show this help message and exit
--model MODEL, -m MODEL
Model path
there are two possible VPU versions, each version has a DPU and DMA model. It is possible to bring up the help menu in the following ways:
vpu_cost_model VPU_2_7 DPU -h
vpu_cost_model VPU_2_7 DMA -h
vpu_cost_model VPU_4_0 DPU -h
vpu_cost_model VPU_4_0 DMA -h
minimal example usage:
vpu_cost_model VPU_2_7 DPU -o CONVOLUTION --inch 64 --outch 64 --height 16 --width 16 --kh 3 --kw 3 --indt UINT8 --outdt UINT8 --mpe-mode CUBOID_16x16
vpu_cost_model VPU_2_7 DMA -l 1024 --sw 1024 --dw 1024 -d DDR2CMX
vpu_cost_model VPU_4_0 DPU -o CONVOLUTION --inch 64 --outch 64 --height 16 --width 16 --kh 3 --kw 3 --indt UINT8 --outdt UINT8 --mpe-mode CUBOID_16x16
vpu_cost_model VPU_4_0 DMA 1024 --sw 1024 --dw 1024 -d DDR2CMX
Generate a VPUNN model from a tensorflow one
optional arguments:
-h, --help show this help message and exit
--name NAME Model name
--output OUTPUT Output model (default model.vpunn)
Convert a VPUNN model into json for debugging purpose
usage: vpunn_to_json [-h] file
positional arguments:
file graphFile to deserialize to json OR an already deserialized json
optional arguments:
-h, --help show this help message and exit
To compile the Web Assembly (WASM) version of the library, follow the steps below:
- Install Emscripten (link here)
- Configure Emscripten with cmake by typing
emmake cmake ..
- Build the Javascript interface
emmake make vpunn_js -j
The build command produces an npm
package that can be later installed in any js project by doing npm install <path to build folder>/dist/vpunn-*.tgz
All developers should install the git hooks that are tracked in the .githooks directory. We use the pre-commit framework for hook management. The recommended way of installing it is using pip:
pip install pre-commit
The hooks can then be installed into your local clone using:
pre-commit install --allow-missing-config
--allow-missing-config is an optional argument that will allow users to have the hooks installed and be functional even if using an older branch that does not have them tracked. A warning will be displayed for such cases when the hooks are ran.
If you want to manually run all pre-commit hooks on a repository, run pre-commit run --all-files
. To run individual hooks use pre-commit run <hook_id>
.
Uninstalling the hooks can be done using
pre-commit uninstall
Tests uses Google test suite for automatizing tests
To run the test suite: ctest --test-dir build/tests/cpp/
Example: running only cost model integration test: ./tests/cpp/test_cost_model
pytest tests/python/test_e2e.py -v
Assuming you build VPUNN WASM library in build_wasm
, install VPUNN locally with all its dependencies.
npm install --prefix tests/js
npm install --save-optional build_wasm/dist/vpunn-*.tgz --prefix tests/js
Start testing by running
npm run test --prefix=tests/js
To generate Code coverage report you need to enable it in CMake
cmake -DCMAKE_BUILD_TYPE=Coverage .. && make coverage -j
This commands generate a coverage
folder into the build one with all the coverage information
Dependencies:
- Gcov-9 and Gcovr tools are needed in order to generate the report
- Only GCC is supported (no WASM/Visual Studio)
Not Available
-
ISI=CLUSTERING + OWT=2 : replaced at runtime with SOK. runtime should be the same, no input halo used
-
Elementwise + ISI=SOK : replaced at runtime with clustering + owt=1, time is a little undervalued, but its the best approximation available
-
CM_CONV (compress convolution) + InputChannels=1
-
SOH (HALO) split with Kernel =1 has probably not been part of training, doesn't make sense to have kernel=1 and input halo.NN predictions are problematic. : replaced at runtime with Clustering.
-
SOH Halo split , at least when H is small, K small, produces much bigger results than SOH Overlapped. This is not realistic, might be a NN limitation. See VPULayerCostModelTest.Unet_perf_SOH_SOK_after_SOHO
-
Output write tiles is limited to 2. EG also when used as mock for NPU4.0 where more than 2 tiles are present and used for split.
-
NPU2.7 splits by H with Halo were trained to NN using the memory tensor instead of the general rule for compute tensor (memory tensor is smaller with half a kernel in general). Calling NN with compute tensor introduces errors by reporting smaller values. To get corrected values (closer to Ground Truth) when generating the descriptor for NNs with interface 11 and SOH isi strategy, we are using not the input tensor, but a computed memory input tensor that mimics the one used at training
Reusing:when using the 2.7 trained version as mock please read the NPU2.7 section above.
- DW_CONV (depthwise convolution)with kernel 3x3 is optimized in NPU4.0, but not in NPU2.7. The NN reported runtime is adjusted with a factor depending on datatype, channels and kernel size Trained NN for 4.0:
- WIP
- NPU2.7: NN was not trained to discriminate the sporadic high runtime for swizzling. EISXW-98656 not solved (ELt wise add with big profiled CLUSTERING, but small SOH) Test: RuntimeELT_CONV_SOH_SOK_EISXW_98656. Elementwise accepts (at NN run) SWizzling ON or OFF but has to be the same for all in/out/wts all 0 (OFF), all 5(ON) combinations not trained. To consider: training of NN with swizzlings combinations (profiling shows runtime is different)
Shave version interface 1 (the old one) will be deleted in the near future, do not use it. SHAVE v2 interface is active.
Details of any operator can be obtained by calling: ShaveOpExecutor::toString() method.
For most updated list of operators and their details see also the unit tests: TestSHAVE.SHAVE_v2_ListOfOperators, TestSHAVE.SHAVE_v2_ListOfOperatorsDetails_27,... .
For information about the profiled operators and extraparameters you can consult this document