The programming examples are a number of sample designs that further help explain many of the unique features of AI Engines and the NPU array in Ryzen™ AI.
The passthrough example is the simplest "getting started" example. It copies 4096 bytes from the input to output using vectorized loads and stores. The design example shows a typical project organization which is easy to reproduce with other examples. There are only really 4 important files here.
aie2.py
The AIE structural design which includes the shim tile connected to the external memory, and a single AIE core for performing the copy. It also shows a simple use of the ObjectFIFOs described in section 2.passthrough.cc
This is a C++ file which performs the vectorized copy operation.test.cpp
ortest.py
A C++ or Python main application for exercising the design, and comparing against a CPU referenceMakefile
A Makefile documenting (and implementing) the build process for the various artifacts.
The passthrough DMAs example shows an alternate method of performing a copy without involving the cores, and instead performing a loopback.
Design name | Data type | Description |
---|---|---|
Vector Scalar Add | i32 | Adds 1 to every element in vector |
Vector Scalar Mul | i32 | Returns a vector multiplied by a scale factor |
Vector Vector Add | i32 | Returns a vector summed with another vector |
Vector Vector Modulo | i32 | Returns vector % vector |
Vector Vector Multiply | i32 | Returns a vector multiplied by a vector |
Vector Reduce Add | bfloat16 | Returns the sum of all elements in a vector |
Vector Reduce Max | bfloat16 | Returns the maximum of all elements in a vector |
Vector Reduce Min | bfloat16 | Returns the minimum of all elements in a vector |
Vector Exp | bfloat16 | Returns a vector representing ex of the inputs |
DMA Transpose | i32 | Transposes a matrix with the Shim DMA using npu_dma_memcpy_nd |
Matrix Scalar Add | i32 | Returns a matrix multiplied by a scalar |
Single core GEMM | bfloat16 | A single core matrix-matrix multiply |
Multi core GEMM | bfloat16 | A matrix-matrix multiply using 16 AIEs with operand broadcast. Uses a simple "accumulate in place" strategy |
GEMV | bfloat16 | A vector-matrix multiply returning a vector |
Design name | Data type | Description |
---|---|---|
Eltwise Add | bfloat16 | An element by element addition of two vectors |
Eltwise Mul | i32 | An element by element multiplication of two vectors |
ReLU | bfloat16 | Rectified linear unit (ReLU) activation function on a vector |
Softmax | bfloat16 | Softmax operation on a matrix |
Conv2D | i8 | A single core 2D convolution for CNNs |
Conv2D+ReLU | i8 | A Conv2D with a ReLU fused at the vector register level |
-
Can you modify the passthrough design to copy more (or less) data?
-
Take a look at the testbench in our Vector Exp example test.cpp. Take note of the data type and the size of the test vector. What do you notice?
-
What is the communication-to-computation ratio in ReLU?
-
HARD Which basic example is a component in Softmax?