Call cost floor + FP16 speed boost / old GPUs #245

blefaudeux · 2021-08-26T17:26:24Z

blefaudeux
Aug 26, 2021

Poking around the fused softmax tutorial, and extending it a little to make it comparable to torch.softmax() (adding autograd mostly), I'm trying to get an understanding of how the perf compares to the very optimized Pytorch CUDA kernels, across a couple of axes. I'm curious to get some context around two observations, which stem from micro-benchmarks on a P100 (arguably a little old):

there seems to be a time floor to any call to a triton kernel, I'm seeing around 70us, even if this kernel was called just before and there's no autotune. Is that expected ? The call time for pytorch kernels can go much lower, towards a couple of us
using the same softmax kernel on fp16 data requires very little change (see tl.exp() and torch.float16 crash unceremoniously #241), but it does not bring any speed benefits. This is somewhat expected compute-wise on a P100, but I would expect some bandwidth benefits, and testing against pytorch this does show up (ie. torch.softmax() on fp16 data gets to be a little faster than fp32 on P100). Is that expected, or could that be that some memory allocations are hardcoded to fp32 within the IR for instance ?

Thanks ! cc @ptillet

ptillet · 2021-08-27T08:44:52Z

ptillet
Aug 27, 2021
Maintainer

Hmm, how are you measuring this? I tried a vector addition with tiny vectors

import torch
import triton
import triton.language as tl

@triton.jit
def add_kernel(x_ptr, y_ptr, output_ptr,n_elements,**meta):
    pid = tl.program_id(axis=0) 
    offsets = tl.arange(0, 128)
    mask = offsets < n_elements
    x = tl.load(x_ptr + offsets, mask=mask)
    y = tl.load(y_ptr + offsets, mask=mask)
    output = x + y
    tl.store(output_ptr + offsets, output)

def add(x, y):
    z = torch.empty_like(x)
    add_kernel[(1,)](x, y, z, x.numel())

x = torch.randn(128, device='cuda')
y = torch.randn(128, device='cuda')
torch_fn = lambda: x + y
triton_fn = lambda: add(x, y)
print(triton.testing.do_bench(torch_fn)[0])
print(triton.testing.do_bench(triton_fn)[0])

and saw that torch and triton had the same overhead:

0.0030720001086592674
0.0030720001086592674

2 replies

blefaudeux Aug 27, 2021
Author

I'm not using triton.testing.do_bench, just torch.utils.benchmark, and I'm looping through shapes (could be why). The code looks like

def bench_functions(test_cases: List[TestCase]):
    min_run_time = MIN_RUN_TIME
    device = torch.device("cuda")
    results = []

    for dtype in [torch.float16, torch.float32]:
        for B, M, K in SHAPES:
            a = torch.rand(B, M, K, device=device, dtype=dtype, requires_grad=True)

            for testcase in test_cases:
                fn_str = "fn(a)"

                results.append(
                    benchmark.Timer(
                        stmt=fn_str,
                        globals={
                            "a": a,
                            "fn": testcase.function,
                        },
                        label=f"{dtype}",
                        sub_label=f"workload: {testcase.name}",
                        description=f"B={B}, M={M}, K={K}",
                    ).blocked_autorange(min_run_time=min_run_time)
                )

    compare = benchmark.Compare(results)
    compare.print()


def pytorch_fw_bw(x):
    y = torch.norm(torch.softmax(x, dim=-1))
    y.backward()


def triton_fw_bw(x):
    y = torch.norm(triton_softmax(x))
    y.backward()


# Test FW
bench_functions(
    [
        TestCase(lambda x: torch.softmax(x, dim=-1), "pytorch - fw"),
        TestCase(triton_softmax, "triton - fw"),
    ]
)

# Test FW+BW
bench_functions(
    [
        TestCase(pytorch_fw_bw, "pytorch - fw+bw"),
        TestCase(triton_fw_bw, "triton - fw+bw"),
    ]
)

blefaudeux Aug 27, 2021
Author

in an actual benchmark I'm seeing a very good speed from Triton, so could be that it's a micro-benchmark issue really. Any thoughts on the fp16 code path ?

blefaudeux · 2021-08-31T22:45:57Z

blefaudeux
Aug 31, 2021
Author

Changing the benchmark method (outside of torch.utils, and confirmed by real life measurements) solves both observations, fp16 gets a significant boost even on old HW and no call cost

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Call cost floor + FP16 speed boost / old GPUs #245

{{title}}

Replies: 2 comments 2 replies

{{title}}

{{title}}

{{editor}}'s edit

{{editor}}'s edit

{{title}}

{{title}}

Select a reply

Call cost floor + FP16 speed boost / old GPUs #245

blefaudeux Aug 26, 2021

Replies: 2 comments · 2 replies

ptillet Aug 27, 2021 Maintainer

blefaudeux Aug 27, 2021 Author

blefaudeux Aug 27, 2021 Author

blefaudeux Aug 31, 2021 Author

blefaudeux
Aug 26, 2021

Replies: 2 comments 2 replies

ptillet
Aug 27, 2021
Maintainer

blefaudeux Aug 27, 2021
Author

blefaudeux Aug 27, 2021
Author

blefaudeux
Aug 31, 2021
Author