Single Kernel development with experimental `printf` support. #53

STFleming · 2024-08-06T13:08:19Z

STFleming
Aug 6, 2024
Maintainer

Riallto enables users to develop kernels that execute on the NPU compute tiles. However, there are still some challenges with this; users need to specify a callgraph, and there is little debug support to determine what is happening within the kernel and measure performance. This blog outlines a lightweight way to rapidly develop kernels that includes some preliminary, experimental printf support. The printf implementation is very lightweight, using variadic templates to reduce instruction memory overheads. It also keeps runtime overheads low by pushing the format string rendering to the host after the kernel has finished executing.

Let's walk through the example below:

[1] This cell imports the Python package for evaluating the kernel along with NumPy.
[2] Cell 2 contains the Kernel C++ code. Here, the kernel performs a memcpy of the input data to the output. We can also an example of how the printf feature is used to report the tile coordinates and the cycle count value.
[3] Input data is specified as NumPy arrays. We have an input buffer populated with random uint8 and are using NumPy to identify the type and shape of the output buffer.
[4] Here, the kernel is executed on the NPU device, and the outputs are recorded. The printf messages are outputted, and the results of the kernel execution are saved.
[5] Python and NumPy enable rapid verification of the output. In a more extensive example, this could also include PyTorch or OpenCV libraries that are easily accessible in Jupyter.

How does out NPU `printf` work?

The printf functionality displayed above provides a convenient way for printing debug messages and performance counter values when quickly developing kernels. However, printf-style functionality often presents challenges in such an embedded context:

printf style messaging consumes significant instruction memory.
The performance overheads of such a logging mechanism are quite high.

Addressing the Instruction Memory Challenge

Compute tiles on NPU devices have only 16KB of instruction memory and, as such, are highly constrained. The standard printf function in C is a complex function that supports various formatting options.

To support all these features, the code for printf is typically quite large, (approx 90KB). It includes code to parse the format string, handle each possible type of input, format the output, and handle errors. Moreover, printf often brings a large part of the standard library, such as, ftoa to convert floats to strings, and so on. There are lightweight versions of printf, such as xil_printf which uses 1KB of instruction memory (our approach uses 80 bytes per unique printf signature). However, they are often missing key functionality such as being able to print floating point values.

Our solution to this is to elaborate calls to our printf logging functions at compile time using C++ variadic templates instead of handling everything at runtime. A variadic template can take an arbitrary number of template arguments at compile time, allowing functions and classes to operate on any number of potentially different types of arguments. Operating at compile time means that the generality provided by a printf-style logging implementation no longer needs a large instruction memory footprint to handle all cases at runtime. See below for an example snippet of a variadic function call:

template<typename T>  
void printf(T value) {  
    // Code to print a single value  
}  
  
template<typename T, typename... Args>  
void printf(T value, Args... args) {  
    printf(args...);    
}

Every printf call in our kernel code with a unique number of parameters and parameter types will have a custom specialised printf function elaborated for it from something like the recursive variadic template seen above (check the gist for the full example). Each printf call will append the format string address, along with the parameters for the call to a buffer that the host will decode later. Doing things in this fashion pushes the parsing overheads and complexity, which typically further increase the instruction memory footprint, to the host where it can be performed after kernel execution has completed.

Addressing the Performance Overhead Challenge

Generally, performing a printf on an embedded system is very expensive as at runtime, you have to:

Parse the format string to determine the types and locations of the arguments.
Process various arguments and convert them to their string data types.
Perform memory management on the formatted string.

In our approach, we redistribute the responsibilities in the printf logging process, with responsibility for the expensive, time-consuming, high-overhead operations from the embedded compute tile to the host. To achieve this, our variadic template printf functions are designed to append only the essential data to the output buffer, minimizing the processing overhead. All that's appended is the address of the constant format string in the instruction memory and the value for each of the parameters, making the process more streamlined and efficient.

The above figure shows an example of this in action, where we have a write log message. Each printf call appends a struct of data to the end of a buffer that depends on the format of variadic call. In the case of this example, it appends three items to the buffer: the address of the format string "iterations=\%u, cycles=\%u\n" avoiding sending the string itself; the value of the integer parameter iter at the time of the call; followed by the integer parameter cycles at the time of call. In total, to send this log message to the host requires 12 Bytes.

After the buffer containing the log messages is sent back to the host, it must then be decoded to reassemble the messages. The figure below shows a rough overview of this process:

Firstly, a mapping (addr2str) from the format string address (in the Compute Tile instruction memory) to the format string is created by parsing the compiled elf file.
Packets are then parsed off the log buffer. The current location of the read pointer (rd\_ptr) is used to get the format string address. The previously created addr2str mapping looks up the corresponding format string.
The looked-up format string is then parsed to determine the number of parameters that the format string requires. The corresponding number of parameters is then peeled off the front of the log buffer. Finally, the format string is rendered, marking the successful completion of the decoding process.

Conclusion

With our logging technique, we can address both the challenges of instruction memory and performance overheads for kernel development. By using variadic templates in C++, we can maintain the flexibility of printf logging by pushing the generality to compile time rather than runtime. By only transferring the format string address and parameters, we effectively shift the expensive parsing operations to the host, where they occur separate from the computation we are logging. This separation of concerns ensures an efficient logging process, requiring minimal data transfer overheads from the Compute Tiles and eliminating the need to transfer strings and on-target parsing.

If you'd like to check out a simple example of using this feature with Riallto please check out this gist [Note this has only been tested on the Linux version of Riallto]

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Single Kernel development with experimental `printf` support. #53

{{title}}

Replies: 0 comments

Select a reply

Single Kernel development with experimental printf support. #53

STFleming Aug 6, 2024 Maintainer

How does out NPU printf work?

Addressing the Instruction Memory Challenge

Addressing the Performance Overhead Challenge

Conclusion

Replies: 0 comments

Single Kernel development with experimental `printf` support. #53

STFleming
Aug 6, 2024
Maintainer

How does out NPU `printf` work?