-
Notifications
You must be signed in to change notification settings - Fork 42
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Port "sub-group transpose reduction" to default path #2266
Comments
Working on generating always NOP reshape ops. |
Adding lit tests locally. Pass 1.0 in a good shape. |
Pass working for |
After integrating all of the changes we have up to this point. This is the reported speedup calculated as
Mean: |
… to pipeline Add the `-tritonintelgpu-optimize-reduction-locality` pass to the pipeline if the `TRITON_INTEL_OPTIMIZE_REDUCTION_LOCALITY` is set to 1. As shown in intel#2266, this pass gives quite promising results, although there is still room for improvement. Conditionally enabling it will greatly help performance investigation. Signed-off-by: victor-eds <[email protected]>
Looks promising! Thanks @victor-eds . |
I have filled a series of followup issues I'd like to address before a deeper performance investigation. I'll add these as a comment in this issue. I would however call this issue done when the outstanding PRs are merged, as the new pass is better (and generates better code) than the one in the advanced path. |
#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):
tt.reshape
tt.reduce
triton_gpu.convert_layout
tt.reduce
triton_gpu.convert_layout
Note 5 can be dropped in case the new layout is beneficial for performance.
The text was updated successfully, but these errors were encountered: