triton.language.reshape is broken #509

khaotik · 2022-05-05T08:43:29Z

khaotik
May 5, 2022

I encountered this bug while working with my fork & Apache MXNet. Using triton.language.reshape would throw compilation error. I can't tell where I broke the package, or perhaps it's bugged on master branch.

import numpy as np
import mxnet as mx
import triton
import triton.language as tl

# (x,y,z) -> (x,z)
@triton.jit
def kernRowReduceAxis1(
        d0, s0, x_stride, y_stride, BLOCK_SIZE: tl.constexpr):
    idx_x = tl.program_id(0)
    idx_yz = tl.arange(0, BLOCK_SIZE)
    idx_z = idx_yz % y_stride
    sz_y,sz_z = x_stride//y_stride, x_stride%y_stride
    rmask = (idx_yz < x_stride)
    x = tl.load(s0 + idx_x*x_stride, mask=rmask)
    x = tl.reshape(x, [sz_y, sz_z])
    xsum = tl.sum(x, axis=0)
    wmask = (idx_y % y_stride == 0)
    tl.store(d0 + idx_x*x_stride + idx_z, xsum, mask=wmask)

def AS_BLOCK_SIZE(i:int):
    return 1<<((i-1).bit_length())

GPU0 = mx.gpu()
N,W,C = 256, 384, 96
v_x = mx.nd.random_uniform(0., 1., shape=(N,W,C), dtype=np.float32, ctx=GPU0)
v_y = mx.nd.empty(shape=(N,C,), dtype=np.float32, ctx=GPU0)

# make sure async init op finishes
hash(str(v_x))
hash(str(v_y))
# call jit
kernRowReduceAxis1[(N,)](v_y, v_x, W*C, C, AS_BLOCK_SIZE(W*C))

Running the above script gives:

Traceback (most recent call last):
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1397, in _compile
    generator.visit(self.parse())
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 923, in visit
    return super().visit(node)
  File "/home/khaotik/anaconda3/envs/mxnet/lib/python3.10/ast.py", line 410, in visit
    return visitor(node)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 430, in visit_Module
    ast.NodeVisitor.generic_visit(self, node)
  File "/home/khaotik/anaconda3/envs/mxnet/lib/python3.10/ast.py", line 418, in generic_visit
    self.visit(item)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 923, in visit
    return super().visit(node)
  File "/home/khaotik/anaconda3/envs/mxnet/lib/python3.10/ast.py", line 410, in visit
    return visitor(node)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 491, in visit_FunctionDef
    has_ret = self.visit_compound_statement(node.body)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 424, in visit_compound_statement
    self.last_ret = self.visit(stmt)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 923, in visit
    return super().visit(node)
  File "/home/khaotik/anaconda3/envs/mxnet/lib/python3.10/ast.py", line 410, in visit
    return visitor(node)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 535, in visit_Assign
    values = self.visit(node.value)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 923, in visit
    return super().visit(node)
  File "/home/khaotik/anaconda3/envs/mxnet/lib/python3.10/ast.py", line 410, in visit
    return visitor(node)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 862, in visit_Call
    ret = fn(*args, _builder=self.builder, **kws)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 44, in wrapper
    return fn(*args, **kwargs)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 702, in reshape
    shape = [x.value for x in shape]
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 702, in <listcomp>
    shape = [x.value for x in shape]
AttributeError: 'tensor' object has no attribute 'value'

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/LNXDATA/WKSP/tritontest/bug.py", line 35, in <module>
    kernRowReduceAxis1[(N,)](v_y, v_x, W*C, C, AS_BLOCK_SIZE(W*C))
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1102, in __call__
    return self.kernel(*wargs, **kwargs, grid=self.grid)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1091, in __call__
    return _triton.runtime.launch(wargs, self.fn.do_not_specialize, cache_key, self.fn.arg_names,
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1070, in add_to_cache
    return self.fn._warmup(key, arg_types=arg_types, device=device_idx, attributes=attributes, constants=constants, num_warps=num_warps, num_stages=num_stages, is_manual_warmup=False)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1373, in _warmup
    binary = self._compile(**compile)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1402, in _compile
    raise CompilationError(self.src, node) from e
triton.code_gen.CompilationError: at 9:29:
def kernRowReduceAxis1(
        d0, s0, x_stride, y_stride, BLOCK_SIZE: tl.constexpr):
    idx_x = tl.program_id(0)
    idx_yz = tl.arange(0, BLOCK_SIZE)
    idx_z = idx_yz % y_stride
    sz_y,sz_z = x_stride//y_stride, x_stride%y_stride
    rmask = (idx_yz < x_stride)
    x = tl.load(s0 + idx_x*x_stride, mask=rmask)
    x = tl.reshape(x, [sz_y, sz_z])

Switching tl.reshape(x, [sz_y, sz_z]) into tl.reshape(x, (sz_y, sz_z)) gives a different error:

......
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 702, in <listcomp>
    shape = [x.value for x in shape]
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 42, in wrapper
    raise ValueError("Did you forget to add @triton.jit ? (`_builder` argument must be provided outside of JIT functions.)")
ValueError: Did you forget to add @triton.jit ? (`_builder` argument must be provided outside of JIT functions.)

The above exception was the direct cause of the following exception:

Traceback (most recent call last):
  File "/media/LNXDATA/WKSP/tritontest/bug.py", line 35, in <module>
    kernRowReduceAxis1[(N,)](v_y, v_x, W*C, C, AS_BLOCK_SIZE(W*C))
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1102, in __call__
    return self.kernel(*wargs, **kwargs, grid=self.grid)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1091, in __call__
    return _triton.runtime.launch(wargs, self.fn.do_not_specialize, cache_key, self.fn.arg_names,
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1070, in add_to_cache
    return self.fn._warmup(key, arg_types=arg_types, device=device_idx, attributes=attributes, constants=constants, num_warps=num_warps, num_stages=num_stages, is_manual_warmup=False)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1373, in _warmup
    binary = self._compile(**compile)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/code_gen.py", line 1402, in _compile
    raise CompilationError(self.src, node) from e
triton.code_gen.CompilationError: at 9:29:
def kernRowReduceAxis1(
        d0, s0, x_stride, y_stride, BLOCK_SIZE: tl.constexpr):
    idx_x = tl.program_id(0)
    idx_yz = tl.arange(0, BLOCK_SIZE)
    idx_z = idx_yz % y_stride
    sz_y,sz_z = x_stride//y_stride, x_stride%y_stride
    rmask = (idx_yz < x_stride)
    x = tl.load(s0 + idx_x*x_stride, mask=rmask)
    x = tl.reshape(x, (sz_y, sz_z))

triton.language.reshape has no unit test. Also I can't find usage of triton.language.reshape in a github global search, so I can't tell whether this part is currently broken on master. Any help on how to fix this bug would be greatly appreciated.

ptillet · 2022-05-05T08:48:34Z

ptillet
May 5, 2022
Maintainer

The error message should be better :) Not on my workstation right now but can you try:

    sz_y: tl.constexpr = x_stride//y_stride
    sz_z: tl.constexpr = x_stride%y_stride

Normal assignment removes the constexpr qualifier, which is necessary for triton tensor shapes

1 reply

khaotik May 5, 2022
Author

Thanks for the hint. I've added more constexprs and fixed a bug. Now my kernel looks like:

@triton.jit
def kernRowReduceAxis1(
        d0, s0, x_stride:tl.constexpr, y_stride:tl.constexpr, BLOCK_SIZE: tl.constexpr):
    idx_x = tl.program_id(0)
    idx_yz = tl.arange(0, BLOCK_SIZE)
    idx_z = idx_yz % y_stride
    sz_y: tl.constexpr = x_stride // y_stride
    sz_z: tl.constexpr = y_stride
    rmask = (idx_yz < x_stride)
    x = tl.load(s0 + idx_x*x_stride, mask=rmask)
    x = tl.reshape(x, (sz_y, sz_z))
    xsum = tl.sum(x, axis=0)
    wmask = (idx_y % y_stride == 0)
    tl.store(d0 + idx_x*x_stride + idx_z, xsum, mask=wmask)

Still got a bug:

......
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 41, in wrapper
    return fn(*args, **kwargs)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/core.py", line 698, in reshape
    return semantic.reshape(input, shape, _builder)
  File "/media/LNXDATA/WKSP/dev/triton/python/triton/language/semantic.py", line 461, in reshape
    if input.type.numel != numel:
AttributeError: 'dtype' object has no attribute 'numel'. Did you mean: 'name'?
......

Some debugger outputs:

> /media/LNXDATA/WKSP/dev/triton/python/triton/language/semantic.py(461)reshape()
-> if input.type.numel != numel:
(Pdb) args
input = <triton.language.core.tensor object at 0x7fad133a31f0>
dst_shape = [384, 96]
builder = <triton._C.libtriton.triton.ir.builder object at 0x7fad133bf4c0>
(Pdb) input.type
triton.language.fp32

Maybe I'm missing something obvious? Still trying to grasp the code atm.

khaotik · 2022-05-06T03:28:59Z

khaotik
May 6, 2022
Author

Okay I've fixed a few more issues regarding my kernel.

#!/usr/bin/env python

import numpy as np
import mxnet as mx
import triton
import triton.language as tl

# (x,y,z) -> (x,z)
@triton.jit
def kernRowReduceAxis1(
        d0, s0,
        x_stride:tl.constexpr,
        y_stride:tl.constexpr,
        BLKSZ_Y: tl.constexpr,
        BLKSZ_Z: tl.constexpr,
        ):
    idx_x = tl.program_id(0)
    idx_yz = tl.arange(0, BLKSZ_Y * BLKSZ_Z)
    idx_z = idx_yz % BLKSZ_Z
    idx_y = idx_yz // BLKSZ_Z
    rmask_z = (idx_z < y_stride)
    rmask_y = (idx_y < (x_stride // y_stride))
    rmask = rmask_z * rmask_y
    x = tl.load(s0 + idx_x*x_stride + idx_y*y_stride + idx_z, mask=rmask)
    x = tl.reshape(x, (BLKSZ_Y, BLKSZ_Z))
    xsum = tl.sum(x, axis=0)
    wmask = (idx_y == 0) * rmask_z
    tl.store(d0 + idx_x*y_stride + idx_z, xsum, mask=wmask)

def AS_BLOCK_SIZE(i:int):
    return 1<<((i-1).bit_length())

GPU0 = mx.gpu()
N,W,C = 256, 384, 256
v_x = mx.nd.random_uniform(0., 1., shape=(N,W,C), dtype=np.float32, ctx=GPU0)
v_y = mx.nd.empty(shape=(N,C,), dtype=np.float32, ctx=GPU0)

# make sure async init op finishes
hash(str(v_x))
hash(str(v_y))
print('[D] begin kernel launch')
kernRowReduceAxis1[(N,)](
    v_y, v_x,
    W*C, C,
    AS_BLOCK_SIZE(W),
    AS_BLOCK_SIZE(C))
print('[D] end kernel launch')

Running the above script gives segfault.

[D] begin kernel launch

Fatal Error: Segmentation fault
Stack trace:
Segmentation fault (core dumped)

The vecadd kernel works on MXNet so far. To separate debugging scope, I'll work on master branch after I've get my torch env working.

0 replies

ptillet · 2022-05-06T04:58:15Z

ptillet
May 6, 2022
Maintainer

Hello!

Sorry for the delay. I think you are seeing a segfault because the shapes in the store are incompatible. xsum has shape BLKSZ_Z and the pointer (because of idx_z) and mask have shape BLKSZ_Y * BLKSZ_Z. If I replace the lsat line with

tl.store(d0 + idx_x*y_stride + tl.arange(0, BLKSZ_Z), xsum)

the program runs. Definitely a bug that you're not getting a better error message. All the frontend-level compiler error messages inside of Triton should be double-checked, because clearly it's not user-friendly enough at the moment.

As for the practical uses of reshape, there are used implicitly in the broadcast (e.g., [:, None]). We have some fancier use cases at OpenAI too. I wouldn't consider reshape to be production-ready at the moment, but this will change as soon as the MLIR backend rewrite is complete. There are a lot of instabilities we plan to fix after the backend abstractions are rewritten.

PS: sorry for all the trouble, Triton should have better error messages.

1 reply

khaotik May 6, 2022
Author

Thanks for the clarification and now I see how triton programming model differs from raw CUDA. I'm still running into verification errors similar to #480. I'll take my time waiting for the update.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

triton.language.reshape is broken #509

{{title}}

Replies: 3 comments 2 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

triton.language.reshape is broken #509

khaotik May 5, 2022

Replies: 3 comments · 2 replies

ptillet May 5, 2022 Maintainer

khaotik May 5, 2022 Author

khaotik May 6, 2022 Author

ptillet May 6, 2022 Maintainer

khaotik May 6, 2022 Author

khaotik
May 5, 2022

Replies: 3 comments 2 replies

ptillet
May 5, 2022
Maintainer

khaotik May 5, 2022
Author

khaotik
May 6, 2022
Author

ptillet
May 6, 2022
Maintainer

khaotik May 6, 2022
Author