Memory spill with bfloat16 multiplication #69

giacomo-brunetta · 2024-10-13T18:21:56Z

giacomo-brunetta
Oct 13, 2024

Hi everyone!
I'm having trouble with a memory spill caused by the multiplication of two v16bfloat16 or two v32bfloat16.
My goal is to have a function that allows to do v16bfloat16 x , v16bfloat16 y -> v16bfloat16 z = x*y and the implementation that I wrote is the following to_v16bfloat16(aie::mul(x, y)).
I can use this function to produce a polynomial approximation as follows:

%%kernel

#include <aie_api/aie.hpp>

const int POLY_GRADE = 2;
const int SIZE = 16;

const bfloat16 coeff[3] = {
    -1.650456069306855,
     1.9964060767155798,
    -0.33722549505883914
};;

template<int N=16>
inline aie::vector<bfloat16, N> bfloat16_mul(aie::vector<bfloat16, N> x, aie::vector<bfloat16, N> y){
    if constexpr (N == 16){
        return to_v16bfloat16(aie::mul(x, y));
    }
    else{ // N == 32
        aie::vector<float,32> z_acc = aie::mul(x, y);
        aie::vector<int16, 64> temp = v64int16(z_acc);
        aie::vector<int16, 32> temp_trunc = aie::filter_odd(temp);
        return aie::vector_cast<bfloat16>(temp_trunc);
    }
}

template<int N=16>
inline aie::vector<bfloat16, N> polynomial(aie::vector<bfloat16, N> x){
    aie::vector<bfloat16,N> y = aie::broadcast<bfloat16,N>(coeff[POLY_GRADE]);
    for(int i = POLY_GRADE - 1; i >= 0; i--){
        y = bfloat16_mul<N>(x, y);
        y = aie::add(coeff[i],y);
    }
    return y;
}

void prod(bfloat16* in_buffer, bfloat16* out_buffer){
    aie::vector<bfloat16, SIZE> x;
    aie::vector<bfloat16, SIZE> y = aie::broadcast<bfloat16, SIZE>(1);
    for(int i=0; i< 1024/SIZE; i++) {
        x = aie::load_v<SIZE>(in_buffer);
        x = polynomial<SIZE>(x);
        aie::store_v(out_buffer, x);
        in_buffer  += SIZE;
        out_buffer += SIZE;
    }
}

And it works correctly. But if I modify the kernel as follows:

void prod(bfloat16* in_buffer, bfloat16* out_buffer){
    aie::vector<bfloat16, SIZE> x;
    aie::vector<bfloat16, SIZE> y = aie::broadcast<bfloat16, SIZE>(1);
    for(int i=0; i< 1024/SIZE; i++) {
        x = aie::load_v<SIZE>(in_buffer);
        x = polynomial<SIZE>(x);
        x = bfloat16_mul<SIZE>(x,x); // <--
        aie::store_v(out_buffer, x);
        in_buffer  += SIZE;
        out_buffer += SIZE;
    }
}

I get this error:

Error: cannot bind variable traversing call #25 to memory (signal of type v8w64 cannot be spilled) :
   variable 8 : __tmp typ=v8w64 bnd=m

Does anyone know how to solve this issue?

giacomo-brunetta · 2024-10-13T18:30:57Z

giacomo-brunetta
Oct 13, 2024
Author

In this case adding 0 solved the issue, but I cannot understand why and neither reproduce the fix.

template<int N=16>
inline aie::vector<bfloat16, N> ln(aie::vector<bfloat16, N> x){
    // ln(x) = log2(x)/log2(e)
    aie::vector<bfloat16,N> log2x = log2<N>(x);
    const bfloat16 base = 0.691406;
    aie::vector<bfloat16,N> base_change = aie::broadcast<bfloat16,N>(base);
    aie::vector<bfloat16,N> lnx = bfloat16_mul<N>(log2x, base_change);
    bfloat16 zero = 0.0;
    return aie::add(lnx,zero); // return lnx generates memory spill
}

1 reply

giacomo-brunetta Oct 14, 2024
Author

Doing further experiments I encountered the same problem also without the use of bfloat16_mul, so I might be related to something else

mariodruiz · 2024-10-14T09:49:07Z

mariodruiz
Oct 14, 2024
Maintainer

Hi @giacomo-brunetta,

I do not understand why are you doing this cast to_v16bfloat16(aie::mul(x, y))? if x and y are vectors of 16 lane vectors of bfloat16 the result should not need casting. Also, you're mixing C intrinsic with AIE API. Can you try with only AIE APIs?

1 reply

giacomo-brunetta Oct 14, 2024
Author

I tried y = aie::mul(x,y); in the beginning and got the following error

 aie::vector<bfloat16,N> y = aie::broadcast<bfloat16,N>(coeff[POLY_GRADE]);
 prod.cc:40:11: error: no viable overloaded '='
        y = aie::mul(x,y);
        ~ ^ ~~~~~~~~~~~~~

So I imagined that the result of the multiplication was of type aie::accum<accfloat,N> rather than aie::vector<bfloat16,N>.

I tried to cast from accfloat to bfloat16 with the aie::to_vector method, but I was not able to do it. that is why I used the intrinsic.

    if constexpr (N == 16){
        aie::accum<accfloat, N> tmp = aie::mul(x, y);
        aie::vector<bfloat16, N> res = tmp.to_vector<bfloat16>(0);
        return res;   
    }

prod.cc:26:44: error: use 'template' keyword to treat 'to_vector' as a dependent template name
        aie::vector<bfloat16, N> res = tmp.to_vector<bfloat16>(0);
                                           ^
                                           template

The mixed implementation works, but sometimes I get the memory spill. Is there a more proper way to do it?

mariodruiz · 2024-10-16T08:50:19Z

mariodruiz
Oct 16, 2024
Maintainer

Hi @giacomo-brunetta,

Can you please answer in two different threads with the full code for both case (when it builds and when it does not)?

0 replies

giacomo-brunetta · 2024-10-18T15:25:54Z

giacomo-brunetta
Oct 18, 2024
Author

This is the code that builds and runs correctly

%%kernel

#include <aie_api/aie.hpp>

const int POLY_GRADE = 2;
const int SIZE = 16;

const bfloat16 coeff[3] = {
    -1.650456069306855,
     1.9964060767155798,
    -0.33722549505883914
};;

template<int N=16>
inline aie::vector<bfloat16, N> bfloat16_mul(aie::vector<bfloat16, N> x, aie::vector<bfloat16, N> y){
    if constexpr (N == 16){
        //aie::accum<accfloat, N> tmp = aie::mul(x, y);
        //aie::vector<bfloat16, N> res = tmp.to_vector<bfloat16>(0);
        //return res;   
        return to_v16bfloat16(aie::mul(x, y));
    }
    else{ // N == 32
        aie::vector<float,32> z_acc = aie::mul(x, y);
        aie::vector<int16, 64> temp = v64int16(z_acc);
        aie::vector<int16, 32> temp_trunc = aie::filter_odd(temp);
        return aie::vector_cast<bfloat16>(temp_trunc);
    }
}

template<int N=16>
inline aie::vector<bfloat16, N> polynomial(aie::vector<bfloat16, N> x){
    aie::vector<bfloat16,N> y = aie::broadcast<bfloat16,N>(coeff[POLY_GRADE]);
    for(int i = POLY_GRADE - 1; i >= 0; i--){
        y = bfloat16_mul<N>(x, y);
        y = aie::add(coeff[i],y);
    }
    return bfloat16_mul<SIZE>(y, y);;
}

void prod(bfloat16* in_buffer, bfloat16* out_buffer){
    aie::vector<bfloat16, SIZE> x;
    aie::vector<bfloat16, SIZE> y = aie::broadcast<bfloat16, SIZE>(1);
    for(int i=0; i< 1024/SIZE; i++) {
        x = aie::load_v<SIZE>(in_buffer);
        x = polynomial<SIZE>(x);
        aie::store_v(out_buffer, x);
        in_buffer  += SIZE;
        out_buffer += SIZE;
    }
}

0 replies

giacomo-brunetta · 2024-10-18T15:31:51Z

giacomo-brunetta
Oct 18, 2024
Author

This code fails to build

Error: cannot bind variable traversing call #25 to memory (signal of type v8w64 cannot be spilled) :
   variable 8 : __tmp typ=v8w64 bnd=m
Error in : (block #25)

%%kernel

#include <aie_api/aie.hpp>

const int POLY_GRADE = 2;
const int SIZE = 16;

const bfloat16 coeff[3] = {
    -1.650456069306855,
     1.9964060767155798,
    -0.33722549505883914
};;

template<int N=16>
inline aie::vector<bfloat16, N> bfloat16_mul(aie::vector<bfloat16, N> x, aie::vector<bfloat16, N> y){
    if constexpr (N == 16){
        //aie::accum<accfloat, N> tmp = aie::mul(x, y);
        //aie::vector<bfloat16, N> res = tmp.to_vector<bfloat16>(0);
        //return res;   
        return to_v16bfloat16(aie::mul(x, y));
    }
    else{ // N == 32
        aie::vector<float,32> z_acc = aie::mul(x, y);
        aie::vector<int16, 64> temp = v64int16(z_acc);
        aie::vector<int16, 32> temp_trunc = aie::filter_odd(temp);
        return aie::vector_cast<bfloat16>(temp_trunc);
    }
}

template<int N=16>
inline aie::vector<bfloat16, N> polynomial(aie::vector<bfloat16, N> x){
    aie::vector<bfloat16,N> y = aie::broadcast<bfloat16,N>(coeff[POLY_GRADE]);
    for(int i = POLY_GRADE - 1; i >= 0; i--){
        y = bfloat16_mul<N>(x, y);
        y = aie::add(coeff[i],y);
    }
    return y; // removed a multiplication here
}

void prod(bfloat16* in_buffer, bfloat16* out_buffer){
    aie::vector<bfloat16, SIZE> x;
    aie::vector<bfloat16, SIZE> y = aie::broadcast<bfloat16, SIZE>(1);
    for(int i=0; i< 1024/SIZE; i++) {
        x = aie::load_v<SIZE>(in_buffer);
        x = polynomial<SIZE>(x);
        x = bfloat16_mul<SIZE>(x, x); //added a multiplication here
        aie::store_v(out_buffer, x);
        in_buffer  += SIZE;
        out_buffer += SIZE;
    }
}

1 reply

giacomo-brunetta Oct 18, 2024
Author

Note: Initially, I suspected that aie::mul was the source of the issue. However, I encountered a similar bug with other operations involving bfloat16 vectors, which indicates that the problem likely originates elsewhere.

From my observations, operations that execute correctly within a single function body tend to cause issues when split across multiple inline functions. One potential cause could be the way vectors are returned, although this is still just a hypothesis.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Memory spill with bfloat16 multiplication #69

{{title}}

Replies: 5 comments 3 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

Memory spill with bfloat16 multiplication #69

giacomo-brunetta Oct 13, 2024

Replies: 5 comments · 3 replies

giacomo-brunetta Oct 13, 2024 Author

giacomo-brunetta Oct 14, 2024 Author

mariodruiz Oct 14, 2024 Maintainer

giacomo-brunetta Oct 14, 2024 Author

mariodruiz Oct 16, 2024 Maintainer

giacomo-brunetta Oct 18, 2024 Author

giacomo-brunetta Oct 18, 2024 Author

giacomo-brunetta Oct 18, 2024 Author

giacomo-brunetta
Oct 13, 2024

Replies: 5 comments 3 replies

giacomo-brunetta
Oct 13, 2024
Author

giacomo-brunetta Oct 14, 2024
Author

mariodruiz
Oct 14, 2024
Maintainer

giacomo-brunetta Oct 14, 2024
Author

mariodruiz
Oct 16, 2024
Maintainer

giacomo-brunetta
Oct 18, 2024
Author

giacomo-brunetta
Oct 18, 2024
Author

giacomo-brunetta Oct 18, 2024
Author