Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Port "sub-group transpose reduction" to default path #2266

Open
victor-eds opened this issue Sep 17, 2024 · 7 comments · Fixed by #2491, #2511, #2521 or #2531 · May be fixed by #2553
Open

Port "sub-group transpose reduction" to default path #2266

victor-eds opened this issue Sep 17, 2024 · 7 comments · Fixed by #2491, #2511, #2521 or #2531 · May be fixed by #2553
Assignees

Comments

@victor-eds
Copy link
Contributor

#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):

  1. Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item tt.reshape
  2. Perform reduction within the work-item tt.reduce
  3. Convert layout so a transposition within the sub-group as explained in the investigation is performed triton_gpu.convert_layout
  4. Finalize reduction (within work-item and possibly within the work-group) tt.reduce
  5. Convert back to initial layout triton_gpu.convert_layout

Note 5 can be dropped in case the new layout is beneficial for performance.

@victor-eds victor-eds self-assigned this Sep 17, 2024
@vlad-penkin vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 17, 2024
@victor-eds victor-eds changed the title Port #2109 to default path Port "sub-group transpose reduction" to default path Sep 18, 2024
@victor-eds victor-eds removed their assignment Sep 18, 2024
@victor-eds victor-eds self-assigned this Sep 30, 2024
@victor-eds
Copy link
Contributor Author

Working on generating always NOP reshape ops.

@victor-eds
Copy link
Contributor Author

Adding lit tests locally. Pass 1.0 in a good shape.

@victor-eds
Copy link
Contributor Author

Pass working for repCluster[0] = 2 case. Working on repCluster[0] > 2 (real cases). Pass itself working fine with it (finalizing tests), will work on layout conversion optimizations too.

@victor-eds
Copy link
Contributor Author

victor-eds commented Oct 24, 2024

After integrating all of the changes we have up to this point. This is the reported speedup calculated as xetla_speedup_slm / xetla_speedup_baseline, with xetla_speedup_X = triton_mean_gbps / xetla_mean_gpbs from (slm vs. baseline). As this pass matches FA for now, we focus on that workload only. Note the pass can be extended to support more reductions and there is still room for improvement.

      Z   H  N_CTX  D_HEAD  CAUSAL  Speedup
0    1  16  16384     128   False     0.994840066037
1    1  16  16384     128    True     0.998617503047
2    1  32  16384      64   False     0.998388486551
3    1  32  16384      64    True    1.21945316547
4    2  16   8192     128   False     0.996481552534
5    2  16   8192     128    True    0.986066253947
6    2  32   8192      64   False    0.983668089797
7    2  32   8192      64    True    1.19552397853
8    4  16   4096     128   False    0.992287982604
9    4  16   4096     128    True    0.975264209835
10   4  32   4096      64   False    0.960293901543
11   4  32   4096      64    True    1.17439716888
12   8  16   2048     128   False    0.983495166614
13   8  16   2048     128    True    1.00029389437
14   8  32   2048      64   False    1.0007848101
15   8  32   2048      64    True   1.20433393254
16  16  16   1024     128   False    0.96880814982
17  16  16   1024     128    True    1.01326103123
18  16  32   1024      64   False   0.950964800075
19  16  32   1024      64    True   1.08050768492
20  32  16    512     128   False   0.979909264277
21  32  16    512     128    True   1.01801789953
22  32  32    512      64   False   0.954538727308
23  32  32    512      64    True   1.0598195142
24   4  48   1024      64   False   0.993554608016
25   4  48   1024      64    True   1.16116589681

Mean: 1.03248991302
Max: 1.21945316547
Min: 0.950964800075

@victor-eds victor-eds reopened this Oct 24, 2024
victor-eds added a commit to victor-eds/intel-xpu-backend-for-triton that referenced this issue Oct 24, 2024
… to pipeline

Add the `-tritonintelgpu-optimize-reduction-locality` pass to the pipeline if the
`TRITON_INTEL_OPTIMIZE_REDUCTION_LOCALITY` is set to 1.

As shown in intel#2266, this pass gives quite promising results, although there is still
room for improvement. Conditionally enabling it will greatly help performance
investigation.

Signed-off-by: victor-eds <[email protected]>
@etiotto
Copy link
Contributor

etiotto commented Oct 24, 2024

After integrating all of the changes we have up to this point. This is the reported speedup calculated as xetla_speedup_slm / xetla_speedup_baseline, with xetla_speedup_X = triton_mean_gbps / xetla_mean_gpbs from (slm vs. baseline). As this pass matches FA for now, we focus on that workload only. Note the pass can be extended to support more reductions and there is still room for improvement.

      Z   H  N_CTX  D_HEAD  CAUSAL  Speedup
0    1  16  16384     128   False     0.994840066037
1    1  16  16384     128    True     0.998617503047
2    1  32  16384      64   False     0.998388486551
3    1  32  16384      64    True    1.21945316547
4    2  16   8192     128   False     0.996481552534
5    2  16   8192     128    True    0.986066253947
6    2  32   8192      64   False    0.983668089797
7    2  32   8192      64    True    1.19552397853
8    4  16   4096     128   False    0.992287982604
9    4  16   4096     128    True    0.975264209835
10   4  32   4096      64   False    0.960293901543
11   4  32   4096      64    True    1.17439716888
12   8  16   2048     128   False    0.983495166614
13   8  16   2048     128    True    1.00029389437
14   8  32   2048      64   False    1.0007848101
15   8  32   2048      64    True   1.20433393254
16  16  16   1024     128   False    0.96880814982
17  16  16   1024     128    True    1.01326103123
18  16  32   1024      64   False   0.950964800075
19  16  32   1024      64    True   1.08050768492
20  32  16    512     128   False   0.979909264277
21  32  16    512     128    True   1.01801789953
22  32  32    512      64   False   0.954538727308
23  32  32    512      64    True   1.0598195142
24   4  48   1024      64   False   0.993554608016
25   4  48   1024      64    True   1.16116589681

Mean: 1.03248991302 Max: 1.21945316547 Min: 0.950964800075

Looks promising! Thanks @victor-eds .
I am looking forward to enabling this by default after you have a chance to test it on other workloads.

@victor-eds
Copy link
Contributor Author

I am looking forward to enabling this by default after you have a chance to test it on other workloads.

I have filled a series of followup issues I'd like to address before a deeper performance investigation. I'll add these as a comment in this issue.

I would however call this issue done when the outstanding PRs are merged, as the new pass is better (and generates better code) than the one in the advanced path.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment