Call cost floor + FP16 speed boost / old GPUs #245
blefaudeux
started this conversation in
General
Replies: 2 comments 2 replies
-
Hmm, how are you measuring this? I tried a vector addition with tiny vectors
and saw that torch and triton had the same overhead:
|
Beta Was this translation helpful? Give feedback.
2 replies
-
Changing the benchmark method (outside of torch.utils, and confirmed by real life measurements) solves both observations, fp16 gets a significant boost even on old HW and no call cost |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Poking around the fused softmax tutorial, and extending it a little to make it comparable to torch.softmax() (adding autograd mostly), I'm trying to get an understanding of how the perf compares to the very optimized Pytorch CUDA kernels, across a couple of axes. I'm curious to get some context around two observations, which stem from micro-benchmarks on a P100 (arguably a little old):
there seems to be a time floor to any call to a triton kernel, I'm seeing around 70us, even if this kernel was called just before and there's no autotune. Is that expected ? The call time for pytorch kernels can go much lower, towards a couple of us
using the same softmax kernel on fp16 data requires very little change (see tl.exp() and torch.float16 crash unceremoniously #241), but it does not bring any speed benefits. This is somewhat expected compute-wise on a P100, but I would expect some bandwidth benefits, and testing against pytorch this does show up (ie. torch.softmax() on fp16 data gets to be a little faster than fp32 on P100). Is that expected, or could that be that some memory allocations are hardcoded to fp32 within the IR for instance ?
Thanks ! cc @ptillet
Beta Was this translation helpful? Give feedback.
All reactions