-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Failure to compile gemm_postop_addmatrix_benchmark.py
with
#2378
Comments
Related to #1716 |
Shared memory is required in order to convert layouts (by the
And let's focus on this snippet:
Here the I think a possible solution is to back propagate the #mma layout from the What Triton does instead is (eventually) to convert the #mma layout produced by the GEMM loop into a blocked layout, and it then keep that blocked layout all the way to the final store. This works fine is the available shared memory size is sufficiently large. On our current GPU (PVC) the size of shared memory is about half what is needed to use a block size of 256X256. If we instead maximized the #mma layout as I described above, no layout conversion would be necessary and therefore no shared memory would be required in order to run the kernel. At that point we should actually be able to generate 2D read/store operations for the tt.load and tt.store mentioned above. |
The issue is caused by the
into:
Note that the pass materializes the following load using a tensor of ptrs:
This is the root of the problem. Subsequently that load is transformed to have blocked layout by the I think that ultimately we want to preserve the blocked pointers and prevent A quick experiment reveals that removing |
The tentative fix is to avoid rewriting blocked poiters if they are used by In the longer term (after #2374 is fixed) we should be in a position to remove the |
The
gemm_postop_addmatrix_benchmark.py
benchmark fails to compile with configuration:because the compiler allocates shared memory buffers that exceed the capacity. To reproduce:
USE_IPEX=0 python gemm_postop_addmatrix_benchmark.py
Notes
Performance of GEMM (no postOp)
Performance of GEMM + postOp (add matrix to result)
The text was updated successfully, but these errors were encountered: