Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto · 2024-09-27T15:27:27Z

We would like to remove the RewriteTensorPointer pass which rewrites block pointers into regular pointers (except when it determines load/store operations on block ptrs can be converted to 2D block reads/writes). The idea is to avoid loosing semantic information too early and instead deal with block ptr that cannot be used to generate 2D block reads/stores while lowering that operation).

For this scheme to work, we first need to improve the lowering code for tt.load and tt.store operations that use a block ptr with an element type that is not (currently) supported by the 2D read instructions available on the target GPU (e.g. the element is FP8).

See #2359 (comment) for more context.

The text was updated successfully, but these errors were encountered:

etiotto · 2024-10-09T18:47:48Z

The first step is to improve axis analysis and add support for blocked pointers to it (#2451).

etiotto · 2024-10-21T13:46:31Z

A reduce test derived for the tutorial 06 now performs better when we coalesce block pointers than if we rewrite them to non-blocked ptr and then coalesce them:

Reduced attn test:

Rewrite block-ptrs and coalesce:

create kernel:_attn_fwd
fused-attention-batch4-head32-d64-fwd-causal=True:
    N_CTX  Triton [FP8]
0  1024.0     126.31971

Avoid rewrite block-ptr and coalesce them directly:

create kernel:_attn_fwd
fused-attention-batch4-head32-d64-fwd-causal=True:
    N_CTX  Triton [FP8]
0  1024.0    135.596459

The performance of the tutorial 06 (unmodified) is still not up to par (axis info analysis is not able yet to detect contiguity on all blocked ptrs in the kernel and therefore some aren't coalesced).

etiotto added performance codegen: gemm codegen: attention labels Sep 27, 2024

vlad-penkin added this to the 0.3 [Triton] Language and Runtime milestone Sep 27, 2024

vlad-penkin assigned etiotto Sep 30, 2024

This was referenced Oct 1, 2024

Assertion error on gemm_splitk_benchmark.py #2377

Open

Failure to compile gemm_postop_addmatrix_benchmark.py with #2378

Closed

This was linked to pull requests Oct 16, 2024

Coalescing for load/store with block ptrs #2502

Merged

[NFI] Copy Coalesce pass for further customization #2514

Merged

etiotto closed this as completed in #2514 Oct 21, 2024

etiotto reopened this Oct 22, 2024

etiotto linked a pull request Oct 22, 2024 that will close this issue

Fix the implementation of isSingleValue to handle blocked pointers #2534

Merged

etiotto closed this as completed in #2534 Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto commented Sep 27, 2024 •

edited

Loading

etiotto commented Oct 9, 2024 •

edited

Loading

etiotto commented Oct 21, 2024 •

edited

Loading

Improve performance of tt.load and tt.store for FP8 when converting block ptr to regular ptrs #2374

Improve performance of tt.load and tt.store for FP8 when converting block ptr to regular ptrs #2374

Comments

etiotto commented Sep 27, 2024 • edited Loading

etiotto commented Oct 9, 2024 • edited Loading

etiotto commented Oct 21, 2024 • edited Loading

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

Improve performance of `tt.load` and `tt.store` for FP8 when converting block ptr to regular ptrs #2374

etiotto commented Sep 27, 2024 •

edited

Loading

etiotto commented Oct 9, 2024 •

edited

Loading

etiotto commented Oct 21, 2024 •

edited

Loading