Skip to content

Commit

Permalink
Merge branch 'main' into tiler-helper
Browse files Browse the repository at this point in the history
  • Loading branch information
hunhoffe authored Oct 30, 2024
2 parents 00e632e + d3da586 commit 108fa81
Show file tree
Hide file tree
Showing 23 changed files with 188 additions and 77 deletions.
6 changes: 6 additions & 0 deletions .github/workflows/buildAndTestRyzenAI.yml
Original file line number Diff line number Diff line change
Expand Up @@ -139,6 +139,12 @@ jobs:
source utils/quick_setup.sh
# quick_setup changes directory to programming_examples, so we need to return to mlir-aie
cd ..
# I have no clue why but the system clock on GHA containers is like 12 hours ahead.
# That means wheels have file with time stamps in the future which makes ninja loop
# forever when configuring. Set the time to some arbitrary stamp in the past just to be safe.
find my_install/mlir -exec touch -a -m -t 201108231405.14 {} \;
./utils/build-mlir-aie-from-wheels.sh ./my_install/mlir build install ./my_install/llvm-aie
# build is created by the build-mlir-aie-from-wheels.sh script
Expand Down
24 changes: 12 additions & 12 deletions docs/conferenceDescriptions/micro24TutorialDescription.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,20 +23,20 @@ Prerequisite: Please bring your laptop so that you can SSH into our Ryzen™ AI-
|------|-------|-----------|----------------|
| 08:00am | Intro to spatial compute and explicit data movement | Kristof | [Programming Guide](../../programming_guide/) |
| 08:15am | "Hello World" from Ryzen™ AI | Joe | [AI Engine Basic Building Blocks](../../programming_guide/section-1/) |
| 08:30am | Data movement on Ryzen™ AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) |
| 09:00am | Your First Program | Kristof | [My First Program](../../programming_guide/section-3) |
| 09:20am | Exercise 1: Build and run your first program | All | [Passthrough](../../programming_examples/basic/passthrough_kernel/) |
| 09:30am | Break | | |
| 10:00am | Exercise 2: Vector-Scalar Mul | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 10:10am | Tracing and performance analysis | Kristof | [Timers](../../programming_guide/section-4/section-4a/) and [Tracing](../../programming_guide/section-4/section-4b/) |
| 10:40am | Exercise 3: Tracing vector-scalar | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 10:50am | Vectorizing on AIE | Kristof | [Kernel Vectorization](../../programming_guide/section-4/section-4c/) |
| 11:10am | Exercise 4: Vectorized vector-scalar | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 11:20pm | Dataflow and larger designs | Joe | [Example Vector Designs](../../programming_guide/section-5/) and [Large Example Designs](../../programming_guide/section-6/) |
| 11:30pm | Exercises | All | [Programming Examples](../../programming_examples/) |
| 08:35am | Exercise 1: Build and run your first program | All | [Passthrough](../../programming_examples/basic/passthrough_kernel/) |
| 08:50am | Data movement on Ryzen™ AI with objectFIFOs | Joe | [Data Movement](../../programming_guide/section-2/) |
| 09:10am | Exercise 2: Explore AIE DMA capabilities | All | [DMA Transpose](../../programming_examples/basic/dma_transpose/) |
| 09:20am | Your First Program | Kristof | [My First Program](../../programming_guide/section-3) |
| 09:50am | Exercise 3: Vector-scalar mul | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 10:00am | Coffee Break | | |
| 10:30am | Tracing and performance analysis | Kristof | [Timers](../../programming_guide/section-4/section-4a/) and [Tracing](../../programming_guide/section-4/section-4b/) |
| 10:50am | Exercise 4: Tracing vector-scalar mul | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 11:00am | Vectorizing on AIE | Kristof | [Kernel Vectorization](../../programming_guide/section-4/section-4c/) |
| 11:20am | Exercise 5: Tracing vectorized vector-scalar | All | [Vector Scalar Mul](../../programming_examples/basic/vector_scalar_mul/) |
| 11:30pm | Dataflow and larger designs | Joe | [Example Vector Designs](../../programming_guide/section-5/) and [Large Example Designs](../../programming_guide/section-6/) |
| 11:40pm | Exercise 6: More examples | All | [Programming Examples](../../programming_examples/) |
| 11:50pm | Close Tutorial | All | |


## Organizers

*Joseph Melber* is a Senior Member of Technical Staff in AMD’s Research and Advanced Development group. At AMD, he is working on hardware architectures and compiler technologies for current and future AMD devices. He received a BS in electrical engineering from the University Buffalo, as well as MS and PhD degrees from the electrical and computer engineering department at Carnegie Mellon University. His research interests include runtime systems, compiler abstractions for data movement, and hardware prototypes for future adaptive heterogeneous computing architectures.
Expand Down
4 changes: 2 additions & 2 deletions programming_examples/basic/dma_transpose/aie2.py
Original file line number Diff line number Diff line change
Expand Up @@ -21,8 +21,8 @@ def my_passthrough(M, K, N, generate_acccess_map=False):
data_transform = TensorTile(
tensor_height=M,
tensor_width=K,
sizes=[1, K, M, 1],
strides=[1, 1, K, 1],
sizes=[1, 1, K, M],
strides=[1, 1, 1, K],
offset=0,
)
if generate_acccess_map:
Expand Down
10 changes: 5 additions & 5 deletions programming_examples/basic/dma_transpose/test.cpp
Original file line number Diff line number Diff line change
Expand Up @@ -186,11 +186,11 @@ int main(int argc, const char *argv[]) {
std::vector<uint32_t> refVecA(N);

// Doing a transpose on the source vector to produce a ref vector
for (uint32_t i = 0; i < M; i++) {
for (uint32_t j = 0; j < K; j++) {
uint32_t src_index = i * K + j;
uint32_t dst_index = j * M + i;
refVecA[dst_index] = srcVecA[src_index];
uint32_t dst_index = 0;
for (uint32_t i = 0; i < K; i++) {
for (uint32_t j = 0; j < M; j++) {
uint32_t src_index = j * K + i;
refVecA[dst_index++] = srcVecA[src_index];
}
}

Expand Down
27 changes: 21 additions & 6 deletions programming_examples/basic/vector_scalar_mul/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -17,35 +17,50 @@ VPATH := ${srcdir}/../../../aie_kernels/aie2
targetname = vectorScalar
data_size = 4096
trace_size = 8192
CHESS ?= true

all: build/final_${data_size}.xclbin build/insts_${data_size}.txt

kristof: build/insts_${data_size}.txt

build/%.o: %.cc
mkdir -p ${@D}
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -c $< -o ${@F}
ifeq ($(CHESS), true)
cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $< -o ${@F};
else
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -c $< -o ${@F};
endif

build/aie_${data_size}.mlir: ${srcdir}/aie2.py
mkdir -p ${@D}
python3 $< ${data_size} 0 > $@

build/aie_trace_${data_size}.mlir: aie2.py
build/aie_trace_${data_size}.mlir: ${srcdir}/aie2.py
mkdir -p ${@D}
python3 $< ${data_size} ${trace_size} > $@

#build/insts_${data_size}.txt: build/final_${data_size}.xclbin
build/final_${data_size}.xclbin: build/aie_${data_size}.mlir build/scale.o
mkdir -p ${@D}
ifeq ($(CHESS), true)
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=insts_${data_size}.txt $(<:%=../%)
else
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=insts_${data_size}.txt $(<:%=../%)
endif

build/final_trace_${data_size}.xclbin: build/aie_trace_${data_size}.mlir build/scale.o
mkdir -p ${@D}
ifeq ($(CHESS), true)
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=insts_${data_size}.txt $(<:%=../%)
else
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
--aie-generate-npu --npu-insts-name=insts_${data_size}.txt $(<:%=../%)
endif

${targetname}_${data_size}.exe: ${srcdir}/test.cpp
rm -rf _build
Expand All @@ -66,11 +81,11 @@ run_py: build/final_${data_size}.xclbin build/insts_${data_size}.txt

trace: ${targetname}_${data_size}.exe build/final_trace_${data_size}.xclbin build/insts_${data_size}.txt
${powershell} ./$< -x build/final_trace_${data_size}.xclbin -i build/insts_${data_size}.txt -k MLIR_AIE -t ${trace_size}
../../utils/parse_trace.py --filename trace.txt --mlir build/aie_trace_${data_size}.mlir --colshift 1 > trace_vs.json
${srcdir}/../../utils/parse_trace.py --filename trace.txt --mlir build/aie_trace_${data_size}.mlir --colshift 1 > trace_vs.json

trace_py: build/final_trace_${data_size}.xclbin build/insts_${data_size}.txt
${powershell} python3 ${srcdir}/test.py -x build/final_trace_${data_size}.xclbin -i build/insts_${data_size}.txt -k MLIR_AIE -t ${trace_size} -s ${data_size}
../../utils/parse_trace.py --filename trace.txt --mlir build/aie_trace_${data_size}.mlir --colshift 1 > trace_vs.json
${srcdir}/../../utils/parse_trace.py --filename trace.txt --mlir build/aie_trace_${data_size}.mlir --colshift 1 > trace_vs.json


clean_trace:
Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -3,8 +3,13 @@
//
// REQUIRES: ryzen_ai, peano
//
// RUN: mkdir -p test_peano
// RUN: cd test_peano
// RUN: make -f %S/Makefile clean
// RUN: make -f %S/Makefile
// RUN: env CHESS=false make -f %S/Makefile
// RUN: %run_on_npu make -f %S/Makefile run | FileCheck %s
// RUN: %run_on_npu make -f %S/Makefile run_py | FileCheck %s
// RUN: make -f %S/Makefile clean
// RUN: env CHESS=false %run_on_npu make -f %S/Makefile trace | FileCheck %s
// RUN: env CHESS=false %run_on_npu make -f %S/Makefile trace_py | FileCheck %s
// CHECK: PASS!
Original file line number Diff line number Diff line change
@@ -0,0 +1,15 @@
// (c) Copyright 2024 Advanced Micro Devices, Inc.
// SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
//
// REQUIRES: ryzen_ai, chess
//
// RUN: mkdir -p test_chess
// RUN: cd test_chess
// RUN: make -f %S/Makefile clean
// RUN: env CHESS=true make -f %S/Makefile
// RUN: %run_on_npu make -f %S/Makefile run | FileCheck %s
// RUN: %run_on_npu make -f %S/Makefile run_py | FileCheck %s
// RUN: make -f %S/Makefile clean
// RUN: env CHESS=true %run_on_npu make -f %S/Makefile trace | FileCheck %s
// RUN: env CHESS=true %run_on_npu make -f %S/Makefile trace_py | FileCheck %s
// CHECK: PASS!
2 changes: 1 addition & 1 deletion programming_examples/ml/bottleneck/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -46,5 +46,5 @@ clean:
chess* *.o insts.txt \
*.log aie_partition.json *.bin BOOT.BIN _x test.exe

run_py:
run_py: build/final.xclbin
${powershell} python3 ${srcdir}/test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
2 changes: 1 addition & 1 deletion programming_examples/ml/conv2d/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -29,7 +29,7 @@ build/final.xclbin: build/${mlirFileName}.mlir build/conv2dk1_i8.o
--no-xchesscc --no-xbridge \
--xclbin-name=${@F} --npu-insts-name=insts.txt $(<:%=../%)

run_py: build/final.xclbin build/insts.txt
run_py: build/final.xclbin
${powershell} python3 ${srcdir}/test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE

clean:
Expand Down
2 changes: 1 addition & 1 deletion programming_examples/ml/conv2d_fused_relu/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -38,5 +38,5 @@ clean:
chess* *.o insts.txt \
*.log aie_partition.json *.bin BOOT.BIN _x test.exe

run_py:
run_py: build/final.xclbin
${powershell} python3 ${srcdir}/test.py -x build/final.xclbin -i build/insts.txt -k MLIR_AIE
2 changes: 1 addition & 1 deletion programming_examples/vision/color_detect/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -57,7 +57,7 @@ else
cp _build/${targetname} $@
endif

run: ${targetname}.exe build/final_${COLORDETECT_WIDTH}.xclbin build/insts.txt
run: ${targetname}.exe build/final_${COLORDETECT_WIDTH}.xclbin
${powershell} ./$< -x build/final_${COLORDETECT_WIDTH}.xclbin -i build/insts.txt -k MLIR_AIE

clean:
Expand Down
2 changes: 1 addition & 1 deletion programming_examples/vision/color_threshold/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ else
cp _build/${targetname} $@
endif

run: ${targetname}.exe build/final_${COLORTHRESHOLD_WIDTH}.xclbin build/insts.txt
run: ${targetname}.exe build/final_${COLORTHRESHOLD_WIDTH}.xclbin
${powershell} ./$< -x build/final_${COLORTHRESHOLD_WIDTH}.xclbin -i build/insts.txt -k MLIR_AIE

clean:
Expand Down
2 changes: 1 addition & 1 deletion programming_examples/vision/edge_detect/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -56,7 +56,7 @@ else
cp _build/${targetname} $@
endif

run: ${targetname}.exe build/final_${EDGEDETECT_WIDTH}.xclbin build/insts.txt
run: ${targetname}.exe build/final_${EDGEDETECT_WIDTH}.xclbin
${powershell} ./$< -x build/final_${EDGEDETECT_WIDTH}.xclbin -i build/insts.txt -k MLIR_AIE

clean:
Expand Down
2 changes: 1 addition & 1 deletion programming_examples/vision/vision_passthrough/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -49,7 +49,7 @@ else
cp _build/${targetname} $@
endif

run: ${targetname}.exe build/final_${PASSTHROUGH_WIDTH}.xclbin build/insts.txt
run: ${targetname}.exe build/final_${PASSTHROUGH_WIDTH}.xclbin
${powershell} ./$< -x build/final_${PASSTHROUGH_WIDTH}.xclbin -i build/insts.txt -k MLIR_AIE

clean:
Expand Down
6 changes: 3 additions & 3 deletions programming_guide/section-3/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -46,8 +46,8 @@ We also need to declare that the compute core will run an external function: a k

```python
# Type declarations
tensor_ty = np.ndarray[(4096,), np.dtype[np.int16]]
tile_ty = np.ndarray[(1024,), np.dtype[np.int16]]
tensor_ty = np.ndarray[(4096,), np.dtype[np.int32]]
tile_ty = np.ndarray[(1024,), np.dtype[np.int32]]
scalar_ty = np.ndarray[(1,), np.dtype[np.int32]]

# AIE Core Function declarations
Expand Down Expand Up @@ -105,7 +105,7 @@ This access and execute pattern runs on the AIE compute core `ComputeTile2` and

## Kernel Code

We can program the AIE compute core using C++ code and compile it with `xchesscc` into a kernel object file. For our local version of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.
We can program the AIE compute core using C++ code and compile it with the selected single-core AIE compiler into a kernel object file. For our local version of vector scalar multiply, we will use a generic implementation of the `scale.cc` source (called [vector_scalar_mul.cc](./vector_scalar_mul.cc)) that can run on the scalar processor part of the AIE. The `vector_scalar_mul_aie_scalar` function processes one data element at a time, taking advantage of AIE scalar datapath to load, multiply and store data elements.

```c
void vector_scalar_mul_aie_scalar(int32_t *a_in, int32_t *c_out,
Expand Down
19 changes: 14 additions & 5 deletions programming_guide/section-4/section-4b/Makefile
Original file line number Diff line number Diff line change
Expand Up @@ -10,11 +10,12 @@ srcdir := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

include ${srcdir}/../../../programming_examples/makefile-common


all: build/final.xclbin

targetname = myFirstProgram

trace_size = 8192
CHESS ?= true

build/aie.mlir: ${srcdir}/aie2.py
mkdir -p ${@D}
Expand All @@ -26,18 +27,26 @@ build/aie_trace.mlir: ${srcdir}/aie2.py

build/scale.o: ${srcdir}/vector_scalar_mul.cc
mkdir -p ${@D}
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -c $< -o ${@F}
ifeq ($(CHESS), true)
cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -c $< -o ${@F};
else
cd ${@D} && ${PEANO_INSTALL_DIR}/bin/clang++ ${PEANOWRAP2_FLAGS} -c $< -o ${@F};
endif

build/final.xclbin: build/aie.mlir build/scale.o
mkdir -p ${@D}
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
$(if $(shell [ $(CHESS) != true ] && echo true), \
--no-xchesscc --no-xbridge \
) \
--aie-generate-npu --npu-insts-name=insts.txt $(<:%=../%)

build/trace.xclbin: build/aie_trace.mlir build/scale.o
mkdir -p ${@D}
cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
--no-xchesscc --no-xbridge \
cd ${@D} && aiecc.py -v --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
$(if $(shell [ $(CHESS) != true ] && echo true), \
--no-xchesscc --no-xbridge \
) \
--aie-generate-npu --npu-insts-name=insts.txt $(<:%=../%)

${targetname}.exe: ${srcdir}/test.cpp
Expand Down
30 changes: 21 additions & 9 deletions programming_guide/section-4/section-4b/README.md
Original file line number Diff line number Diff line change
Expand Up @@ -34,22 +34,34 @@ Enabling trace support can be done with the following steps:
Enabling tracing means (1a) configuring the trace units for a given tile and then (1b) routing the generated events packets through the stream switches to the shim DMA where we can write them to a buffer in DDR for post-runtime processing.

### <u>(1a) Configure trace units for an AIE tile</u>
The first necessary component for trace configuration is setting the right values for the trace control registers for each tile that we want to enable tracing for. In addition, the generated trace packets will need to be routed to shimDMA and then written to one of the 3 inout buffers. We have abstracted these two steps with the python wrapper function `configure_simple_tracing_aie2` which is in [python/utils/test.py](../../../python/utils/test.py) and is described in more detail the [README](../../../python/utils) under `python/utils`. An example of how this function is used is shown below for quick reference
The first necessary component for trace configuration is setting the right values for the trace control registers for each tile that we want to enable tracing for. In addition, the generated trace packets will need to be routed to shimDMA and then written to one of the 3 inout buffers. We have abstracted these two steps with the python wrapper function `configure_packet_tracing_aie2` which is in [python/utils/test.py](../../../python/utils/test.py) and is described in more detail the [README](../../../python/utils) under `python/utils`. An example of how this function is used is shown below for quick reference
```python
trace_utils.configure_simple_tracing_aie2(
ComputeTile2,
ShimTile,
ddr_id=2,
size=traceSizeInBytes,
offset=tensorSize,
)
trace_utils.configure_packet_tracing_aie2(tiles_to_trace, ShimTile, opts.trace_size, 4096*4)
```
The arguments for this example are
* *tiles_to_trace* - array of compute tiles we want to trace
* *ShimTile* - shim tile that the trace is going out to
* *opts.trace_size* - the trace buffer size in bytes
* *4096*4* - the output buffer offset in bytes where the trace data begins

This block is defined within the sequence definition for `@runtime_sequence` where we define the shimDMA data movement to the 3 inout buffers.
> **Note** This simplification works very well for the trace buffer from a single tile to the shimDMA. However, if we want to do something more advaned like allocating the trace buffer from multiple tiles into a single larger buffer, this function will not be able to express that. For that, please consult the [README](../../../python/utils) under `python/utils` for more guidance on how to customize the trace configuration.
> **Note** This simplified wrapper is an enahnced version of the simpler `configure_simple_tracing_aie2` used previously which routed the trace from a single compute tile using circuit switched routing. This enhanced version relies on packet swtiched routing and supports tracing from multiple tiles by synchronizing the start event for each tile's trace unit to a user generated event. More details can be found in the [README](../../../python/utils) under `python/utils` for more guidance on how to customize the trace configuration.
### <u>(1b) Define trace event routes from tile to shimDMA</u>
Once the trace units and shimDMA are configured, we need to define how the trace packets are routed from compute tile to shim tile. This is done via circuit switched flows or packet switched flows as described below. Note that trace units in the MemTile and ShimTile can also be configured and routed.

We can simplify the defining the packet switched flows for the tiles we're tracing with the function `configure_packet_tracing_flow` defined in [python/utils/test.py](../../../python/utils/test.py) and is described in more detail the [README](../../../python/utils) under `python/utils`. An example of how this function is used is shown below for quick reference
```python
trace_utils.configure_packet_tracing_flow(tiles_to_trace, ShimTile)
```
The arguments for this example are
* *tiles_to_trace* - array of compute tiles we want to trace
* *ShimTile* - shim tile that the trace is going out to

> **Note** The synchronization of this function with the previous is `configure_packet_tracing_aie` is important because we track the route IDs and bd numbers of each configured trace. Do not mix and match these with circuit switched routing as they are intended to work together as a packet tracing pair.
More details about the mechanics for circuit and packet switched flows is described below if interested. Otherwise, you can skip ahead to 2. Configure host code to read trace data and write it to a text file.

#### <u>Circuit switched flows</u>
An example of a simple circuit switch routing flow to route trace event packets from a compute tile to a shimDMA would be:

Expand Down
Loading

0 comments on commit 108fa81

Please sign in to comment.