ISSUE: ERROR when using shared buffer and lock function #1743

ngdymx · 2024-09-04T05:27:56Z

Hi team,

I am working on using aie.use_lock to control the write and read sequence for sharing a buffer between two ComputeTiles. However, I’ve encountered an issue that I’m unsure how to resolve.

Could you please assist me with this? I would greatly appreciate your help.

I have configured the ComputeTiles (1, 4) and (0, 4) as follows: the Add one function is implemented on tile (1, 4), while the passthrough function is placed on tile (0, 4). The attached figure illustrates the dataflow diagram for this setup.

I mimic the aie.mlir under the folder /mlir-aie/mlir_tutorials/tutorial-3 to use the aie.use_lock.

%core_1_4 = aie.core(%tile_1_4) {      
      // while (1)                                                                                                                                      
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1  {         
      // for i in range(4):                                                                                                                                                                 
             scf.for %arg1 = %c0_0 to %c4 step %c1_1{                                                                                                                                 
                  %0 = aie.objectfifo.acquire @in(Consume, 1) : !aie.objectfifosubview<memref<128xi32>>                                                                                  
                  %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>     
                  //                                                             
                  aie.use_lock(%lock04, "Acquire", 0)                                                                                                                                                                                                                                                           
                  func.call @addOne(%1, %shareBuffer, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()                                                                         
                  aie.use_lock(%lock04, "Release", 1)     
                  //                                                                                                                               
                  aie.objectfifo.release @in(Consume, 1)                                                                                                                                 
           }                                                                                                                                                                         
    } 
                                                                                                                                         
%core_0_4 = aie.core(%tile_0_4) {   
     // while (1)                                                                                                                                     
     scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {    
     // for i in range(4):                                                                
           scf.for %arg1 = %c0_0 to %c4 step %c1_1{                                                                                                                                 
                 %0 = aie.objectfifo.acquire @out(Produce, 1) : !aie.objectfifosubview<memref<128xi32>>                                                                                 
                 %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>   
                 //
                 aie.use_lock(%lock04, "Acquire", 1)                                                                                                                                    
                 memref.copy %shareBuffer, %1 : memref<128xi32> to memref<128xi32>                                                                                                                  
                 aie.use_lock(%lock04, "Release", 0)  
                 //                                                                                                                                  
                 aie.objectfifo.release @out(Produce, 1)                                                                                                                                                                                                                                                                                                                
           }                                                                                                                                                                        
   }

The input data is:

The output I am expecting to see is as follows:

The output I got:

After obtaining this result, I modified the passthrough function on tile (0, 4) to an add_i function, where i is the index parameter of the for loop, the range of i is 1 to 4 step 1. This change was made to test the running time of tile (0, 4).

The result I got:

I have confirmed that the running time for tile (0, 4) is 4, which matches our expectations.

Then, I modified theadd one function on tile (1, 4) to an add_i function to test the running time of tile (1, 4).

The result I got:

In this case, the running time for tile (1, 4) appears to occur only once. However, since tile (0, 4) operates successfully, the aie.use_lock acquire and release operations on tile (1, 4) should also function correctly. This suggests that the lock acquire/release on tile (1, 4) are actually ran for four times, but the function (kernel) is only being called once.

Would you be able to help me look into this matter? I would greatly appreciate it.

The following the main code:

Function code:

#define NOCPP

#include <stdint.h>
#include <stdio.h>
#include <stdlib.h>
#include <type_traits>

#include <aie_api/aie.hpp>

template <typename T, int N>
__attribute__((noinline)) void add_one_aie(T *restrict in, T *restrict out, const int32_t size) {
    event0();
  aie::vector<T, N> In;
  aie::vector<T, N> Out;
  aie::vector<T, N> ones = aie::broadcast<T, N>(1);

  for (int j = 0; j < size; j += N) 
    chess_prepare_for_pipelining chess_loop_range(6, ) { 
        In = aie::load_v<N>(in);
        in += N;
        Out = aie::add(In, ones);
        aie::store_v(out, Out);
        out += N;        
    }
  event1();
}

template <typename T, int N>
__attribute__((noinline)) void addi_aie(T *restrict in, T *restrict out, const int32_t size) {
    static int i = 1;
event0();
  aie::vector<T, N> In0;
  aie::vector<T, N> Out;
  aie::vector<T, N> iones = aie::broadcast<T, N>(i);
  aie::vector<T, N> ones = aie::broadcast<T, N>(1);
  aie::vector<T, N> zeros = aie::broadcast<T, N>(0);

  for (int j = 0; j < size; j += N) 
    chess_prepare_for_pipelining chess_loop_range(6, ) { 
        In0 = aie::load_v<N>(in);
        in += N;
        Out = aie::add(In0, iones);
        aie::store_v(out, Out);
        out += N;        
    }
    i = i + 1;
event1();
}

extern "C" {

void addi(int32_t *in, int32_t *out, int32_t tilesize) {
  addi_aie<int32_t, 16>(in, out, tilesize);
}

void addOne(int32_t *in, int32_t *out, int32_t tilesize) {
  add_one_aie<int32_t, 16>(in, out, tilesize);
}

}

aie.mlir

module {
  aie.device(npu1_2col) {
    func.func private @addi(memref<128xi32>, memref<128xi32>, i32)
    func.func private @addOne(memref<128xi32>, memref<128xi32>, i32)
    %tile_0_0 = aie.tile(0, 0)
    %tile_1_0 = aie.tile(1, 0)
    %tile_0_4 = aie.tile(0, 4)
    %tile_1_4 = aie.tile(1, 4)
    aie.objectfifo @in(%tile_1_0, {%tile_1_4}, 2 : i32) : !aie.objectfifo<memref<128xi32>>
    aie.objectfifo @out(%tile_0_4, {%tile_0_0}, 2 : i32) : !aie.objectfifo<memref<128xi32>>
    %shareBuffer = aie.buffer(%tile_0_4) {sym_name = "shareBuffer"} : memref<128xi32> 
    %lock04 = aie.lock(%tile_0_4, 1) { sym_name = "lock_a04" }

    %core_1_4 = aie.core(%tile_1_4) {
      %c0 = arith.constant 0 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      %c1 = arith.constant 1 : index
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {
        %c0_0 = arith.constant 0 : index
        %c4 = arith.constant 4 : index
        %c1_1 = arith.constant 1 : index
        scf.for %arg1 = %c0_0 to %c4 step %c1_1{
          %0 = aie.objectfifo.acquire @in(Consume, 1) : !aie.objectfifosubview<memref<128xi32>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>
          aie.use_lock(%lock04, "Acquire", 0)
          %c128_i32 = arith.constant 128 : i32
          func.call @addi(%1, %shareBuffer, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()
          aie.use_lock(%lock04, "Release", 1)
          aie.objectfifo.release @in(Consume, 1)
       }
      }
      aie.end
    } {link_with = "passThrough.o"}
    %core_0_4 = aie.core(%tile_0_4) {
      %c0 = arith.constant 0 : index
      %c9223372036854775807 = arith.constant 9223372036854775807 : index
      %c1 = arith.constant 1 : index
      scf.for %arg0 = %c0 to %c9223372036854775807 step %c1 {
        %c0_0 = arith.constant 0 : index
        %c4 = arith.constant 4 : index
        %c1_1 = arith.constant 1 : index
        scf.for %arg1 = %c0_0 to %c4 step %c1_1{
          %0 = aie.objectfifo.acquire @out(Produce, 1) : !aie.objectfifosubview<memref<128xi32>>
          %1 = aie.objectfifo.subview.access %0[0] : !aie.objectfifosubview<memref<128xi32>> -> memref<128xi32>
          %c128_i32 = arith.constant 128 : i32
          aie.use_lock(%lock04, "Acquire", 1)
          memref.copy %shareBuffer, %1 : memref<128xi32> to memref<128xi32>
          //func.call @addi(%shareBuffer, %1, %c128_i32) : (memref<128xi32>, memref<128xi32>, i32) -> ()
          aie.use_lock(%lock04, "Release", 0)
          aie.objectfifo.release @out(Produce, 1)
          
        }
      }
      aie.end
    } {link_with = "passThrough.o"}
    aiex.runtime_sequence(%arg0: memref<512xi32>, %arg1: memref<512xi32>) {
      aiex.npu.dma_memcpy_nd(0, 0, %arg1[0, 0, 0, 0][1, 1, 1, 512][0, 0, 0, 1]) {id = 0 : i64, metadata = @out} : memref<512xi32>
      aiex.npu.dma_memcpy_nd(0, 0, %arg0[0, 0, 0, 0][1, 1, 1, 512][0, 0, 0, 1]) {id = 1 : i64, metadata = @in} : memref<512xi32>
      aiex.npu.sync {channel = 0 : i32, column = 0 : i32, column_num = 1 : i32, direction = 0 : i32, row = 0 : i32, row_num = 1 : i32}
    }
  }
}

Host:

#include <cstdint>
#include <fstream>
#include <iostream>
#include <sstream>

#include "test_utils.h"
#include "xrt/xrt_bo.h"

#ifndef DATATYPES_USING_DEFINED
#define DATATYPES_USING_DEFINED
// ------------------------------------------------------
// Configure this to match your buffer data type
// ------------------------------------------------------
// using DATATYPE = std::uint8_t;
// using DATATYPE = std::uint32_t;
using DATATYPE = std::int32_t;
#endif

namespace po = boost::program_options;

int main(int argc, const char *argv[]) {

  // Program arguments parsing
  po::options_description desc("Allowed options");
  po::variables_map vm;
  test_utils::add_default_options(desc);

  test_utils::parse_options(argc, argv, desc, vm);
  int verbosity = vm["verbosity"].as<int>();
  int trace_size = vm["trace_sz"].as<int>();

  constexpr bool VERIFY = true;
  constexpr int IN_VOLUME = 512;
  constexpr int OUT_VOLUME = IN_VOLUME;

  int IN_SIZE = IN_VOLUME * sizeof(DATATYPE);
  int OUT_SIZE = OUT_VOLUME * sizeof(DATATYPE);

  // Load instruction sequence
  std::vector<uint32_t> instr_v =
      test_utils::load_instr_sequence(vm["instr"].as<std::string>());

  if (verbosity >= 1)
    std::cout << "Sequence instr count: " << instr_v.size() << "\n";

  // Start the XRT context and load the kernel
  xrt::device device;
  xrt::kernel kernel;

  test_utils::init_xrt_load_kernel(device, kernel, verbosity,
                                   vm["xclbin"].as<std::string>(),
                                   vm["kernel"].as<std::string>());

  // set up the buffer objects
  auto bo_instr = xrt::bo(device, instr_v.size() * sizeof(int),
                          XCL_BO_FLAGS_CACHEABLE, kernel.group_id(1));
  auto bo_inA =
      xrt::bo(device, IN_SIZE, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(3));
  auto bo_outC =
      xrt::bo(device, OUT_SIZE, XRT_BO_FLAGS_HOST_ONLY, kernel.group_id(4));

  if (verbosity >= 1)
    std::cout << "Writing data into buffer objects.\n";

  // Copy instruction stream to xrt buffer object
  void *bufInstr = bo_instr.map<void *>();
  memcpy(bufInstr, instr_v.data(), instr_v.size() * sizeof(int));

  // Initialize buffer bo_inA
  DATATYPE *bufInA = bo_inA.map<DATATYPE *>();
  for (int i = 0; i < IN_VOLUME; i++)
    bufInA[i] = i + 1;

// print input
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t test = bufInA[i];
    printf("%d ", test);
    if (i % 32 == 31){
        printf("\n");
    }

  }
  // Zero out buffer bo_outC
  DATATYPE *bufOut = bo_outC.map<DATATYPE *>();
  memset(bufOut, 0, OUT_SIZE);

  // sync host to device memories
  bo_instr.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_inA.sync(XCL_BO_SYNC_BO_TO_DEVICE);
  bo_outC.sync(XCL_BO_SYNC_BO_TO_DEVICE);

  float npu_time_total = 0;
  // Execute the kernel and wait to finish
  if (verbosity >= 1)
    std::cout << "Running Kernel.\n";
  auto start = std::chrono::high_resolution_clock::now();
  unsigned int opcode = 3;
  auto run =
      kernel(opcode, bo_instr, instr_v.size(), bo_inA, bo_outC);
  run.wait();
   auto stop = std::chrono::high_resolution_clock::now();

  // Sync device to host memories
  bo_outC.sync(XCL_BO_SYNC_BO_FROM_DEVICE);
  float npu_time =
        std::chrono::duration_cast<std::chrono::microseconds>(stop - start)
                    .count();

  npu_time_total += npu_time;
  std::cout << std::endl
           << "Avg NPU time: " << npu_time_total << "us."
           << std::endl;

  // Compare out to golden
  int errors = 0;
  if (verbosity >= 1) {
    std::cout << "Verifying results ..." << std::endl;
  }
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t ref = bufInA[i] + 1;
    int32_t test = bufOut[i];
    if (test != ref) {
      if (verbosity >= 1)
        std::cout << "Error in output " << test << " != " << ref << std::endl;
      errors++;
    } else {
      if (verbosity >= 1)
        std::cout << "Correct output " << test << " == " << ref << std::endl;
    }
  }
//Print out
  for (uint32_t i = 0; i < IN_VOLUME; i++) {
    int32_t test = bufOut[i];
    printf("%d ", test);
    if (i % 32 == 31){
        printf("\n");
    }

  }


  // Print Pass/Fail result of our test
  if (!errors) {
    std::cout << std::endl << "PASS!" << std::endl << std::endl;
    return 0;
  } else {
    std::cout << std::endl
              << errors << " mismatches." << std::endl
              << std::endl;
    std::cout << std::endl << "fail." << std::endl << std::endl;
    return 1;
  }
}

#
# This file is licensed under the Apache License v2.0 with LLVM Exceptions.
# See https://llvm.org/LICENSE.txt for license information.
# SPDX-License-Identifier: Apache-2.0 WITH LLVM-exception
#
# Copyright (C) 2024, Advanced Micro Devices, Inc.

SRCDIR := $(shell dirname $(realpath $(firstword $(MAKEFILE_LIST))))

VITIS_ROOT ?= $(shell realpath $(dir $(shell which vitis))/../)
VITIS_AIETOOLS_DIR ?= ${VITIS_ROOT}/aietools
VITIS_AIE_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/versal_prod/lib
VITIS_AIE2_INCLUDE_DIR ?= ${VITIS_ROOT}/aietools/data/aie_ml/lib

CHESSCC1_FLAGS = -f -p me -P ${VITIS_AIE_INCLUDE_DIR} -I ${VITIS_AIETOOLS_DIR}/include
CHESSCC2_FLAGS = -f -p me -P ${VITIS_AIE2_INCLUDE_DIR} -I ${VITIS_AIETOOLS_DIR}/include -D__AIENGINE__=2 -D__AIEARCH__=20
CHESS_FLAGS = -P ${VITIS_AIE_INCLUDE_DIR}

CHESSCCWRAP1_FLAGS = aie -I ${VITIS_AIETOOLS_DIR}/include 
CHESSCCWRAP2_FLAGS = aie2 -I ${VITIS_AIETOOLS_DIR}/include 

trace_size = 8192

HOST_O_DIR := build/host
HOST_C_TARGET := host.exe

KERNEL_O_DIR := build/bitstream
KERNEL_SRCS := $(wildcard $(SRCDIR)/kernel/*.cc)
KERNEL_OBJS := $(patsubst $(SRCDIR)/kernel/%.cc, ${KERNEL_O_DIR}/%.o, $(KERNEL_SRCS))
KERNEL_HEADERS := $(wildcard $(SRCDIR)/kernel/*.h)

MLIR_O_DIR := kernel
MLIR_TARGET := ${MLIR_O_DIR}/aie.mlir

BITSTREAM_O_DIR := build/bitstream
XCLBIN_TARGET := ${BITSTREAM_O_DIR}/final.xclbin
INSTS_TARGET := ${BITSTREAM_O_DIR}/insts.txt

.PHONY: all kernel link bitstream host clean
all: ${XCLBIN_TARGET} ${INSTS_TARGET} ${HOST_C_TARGET}

clean:
	-@rm -rf build 
	-@rm -rf log

kernel: ${KERNEL_OBJS}


link: ${MLIR_TARGET}


bitstream: ${XCLBIN_TARGET}


host: ${HOST_C_TARGET}


# Build host
${HOST_C_TARGET}: ${SRCDIR}/host/host.cpp 
	rm -rf ${HOST_O_DIR}
	mkdir -p ${HOST_O_DIR}
	cd ${HOST_O_DIR} && cmake -E env CXXFLAGS="-std=c++23 -ggdb" cmake ../.. -D CMAKE_C_COMPILER=gcc-13 -D CMAKE_CXX_COMPILER=g++-13 -DTARGET_NAME=${HOST_C_TARGET} -Dsubdir=${subdir}
	cd ${HOST_O_DIR} && cmake --build . --config Release
	cp ${HOST_O_DIR}/${HOST_C_TARGET} ./

# Build kernels
${KERNEL_O_DIR}/%.o: ${SRCDIR}/kernel/%.cc ${KERNEL_HEADERS}
	mkdir -p ${@D}
	cd ${@D} && xchesscc_wrapper ${CHESSCCWRAP2_FLAGS} -DINT8_ACT -c $< -o ${@F}

# Build bitstream
${XCLBIN_TARGET}: ${MLIR_TARGET} 
	mkdir -p ${@D}
	cd ${BITSTREAM_O_DIR} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
		--aie-generate-npu --npu-insts-name=${INSTS_TARGET:${BITSTREAM_O_DIR}/%=%} $(<:${MLIR_O_DIR}/%=../../kernel/%) 
	#cd ${@D} && aiecc.py --aie-generate-cdo --no-compile-host --xclbin-name=${@F} \
	#			--aie-generate-npu --npu-insts-name=insts.txt $(<:%=../%)

.PHONY: run
run: ${HOST_C_TARGET} ${XCLBIN_TARGET} ${INSTS_TARGET} #sign
	export XRT_HACK_UNSECURE_LOADING_XCLBIN=1 && \
	./$< -x ${XCLBIN_TARGET} -i ${INSTS_TARGET} -k MLIR_AIE

trace: ${HOST_C_TARGET} ${XCLBIN_TARGET} ${INSTS_TARGET} # sign
	export XRT_HACK_UNSECURE_LOADING_XCLBIN=1 && \
	./$< -x ${XCLBIN_TARGET} -i ${INSTS_TARGET} -k MLIR_AIE -t ${trace_size}
	./parse_trace.py --filename trace.txt --mlir ${MLIR_TARGET} --colshift 1 > trace_mm.json

run_py: ${XCLBIN_TARGET} ${INSTS_TARGET} ${SRCDIR}/host/test.py
	python3 ${SRCDIR}/host/test.py -x ${<} -i ${INSTS_TARGET} -k MLIR_AIE

The text was updated successfully, but these errors were encountered:

AndraBisca self-assigned this Sep 20, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

ISSUE: ERROR when using shared buffer and lock function #1743

ISSUE: ERROR when using shared buffer and lock function #1743

ngdymx commented Sep 4, 2024

ISSUE: ERROR when using shared buffer and lock function #1743

ISSUE: ERROR when using shared buffer and lock function #1743

Comments

ngdymx commented Sep 4, 2024