Port "sub-group transpose reduction" to default path #2266

victor-eds · 2024-09-17T12:28:12Z

#2109 explores layout conversion in the advanced path to improve reduction performance (see #1637 for investigation). Porting this to the default path would involve a transformation similar to (after heuristics to check profitability):

Reshape input tensor so no data movement is needed and we can perform reduction of elements within the work-item tt.reshape
Perform reduction within the work-item tt.reduce
Convert layout so a transposition within the sub-group as explained in the investigation is performed triton_gpu.convert_layout
Finalize reduction (within work-item and possibly within the work-group) tt.reduce
Convert back to initial layout triton_gpu.convert_layout

Note 5 can be dropped in case the new layout is beneficial for performance.

The text was updated successfully, but these errors were encountered:

victor-eds · 2024-10-07T11:47:59Z

Working on generating always NOP reshape ops.

victor-eds · 2024-10-15T11:53:16Z

Adding lit tests locally. Pass 1.0 in a good shape.

victor-eds · 2024-10-21T11:37:24Z

Pass working for repCluster[0] = 2 case. Working on repCluster[0] > 2 (real cases). Pass itself working fine with it (finalizing tests), will work on layout conversion optimizations too.

victor-eds · 2024-10-24T10:49:14Z

After integrating all of the changes we have up to this point. This is the reported speedup calculated as xetla_speedup_slm / xetla_speedup_baseline, with xetla_speedup_X = triton_mean_gbps / xetla_mean_gpbs from (slm vs. baseline). As this pass matches FA for now, we focus on that workload only. Note the pass can be extended to support more reductions and there is still room for improvement.

      Z   H  N_CTX  D_HEAD  CAUSAL  Speedup
0    1  16  16384     128   False     0.994840066037
1    1  16  16384     128    True     0.998617503047
2    1  32  16384      64   False     0.998388486551
3    1  32  16384      64    True    1.21945316547
4    2  16   8192     128   False     0.996481552534
5    2  16   8192     128    True    0.986066253947
6    2  32   8192      64   False    0.983668089797
7    2  32   8192      64    True    1.19552397853
8    4  16   4096     128   False    0.992287982604
9    4  16   4096     128    True    0.975264209835
10   4  32   4096      64   False    0.960293901543
11   4  32   4096      64    True    1.17439716888
12   8  16   2048     128   False    0.983495166614
13   8  16   2048     128    True    1.00029389437
14   8  32   2048      64   False    1.0007848101
15   8  32   2048      64    True   1.20433393254
16  16  16   1024     128   False    0.96880814982
17  16  16   1024     128    True    1.01326103123
18  16  32   1024      64   False   0.950964800075
19  16  32   1024      64    True   1.08050768492
20  32  16    512     128   False   0.979909264277
21  32  16    512     128    True   1.01801789953
22  32  32    512      64   False   0.954538727308
23  32  32    512      64    True   1.0598195142
24   4  48   1024      64   False   0.993554608016
25   4  48   1024      64    True   1.16116589681

Mean: 1.03248991302
Max: 1.21945316547
Min: 0.950964800075

… to pipeline Add the `-tritonintelgpu-optimize-reduction-locality` pass to the pipeline if the `TRITON_INTEL_OPTIMIZE_REDUCTION_LOCALITY` is set to 1. As shown in intel#2266, this pass gives quite promising results, although there is still room for improvement. Conditionally enabling it will greatly help performance investigation. Signed-off-by: victor-eds <[email protected]>

etiotto · 2024-10-24T14:26:31Z

After integrating all of the changes we have up to this point. This is the reported speedup calculated as xetla_speedup_slm / xetla_speedup_baseline, with xetla_speedup_X = triton_mean_gbps / xetla_mean_gpbs from (slm vs. baseline). As this pass matches FA for now, we focus on that workload only. Note the pass can be extended to support more reductions and there is still room for improvement.

      Z   H  N_CTX  D_HEAD  CAUSAL  Speedup
0    1  16  16384     128   False     0.994840066037
1    1  16  16384     128    True     0.998617503047
2    1  32  16384      64   False     0.998388486551
3    1  32  16384      64    True    1.21945316547
4    2  16   8192     128   False     0.996481552534
5    2  16   8192     128    True    0.986066253947
6    2  32   8192      64   False    0.983668089797
7    2  32   8192      64    True    1.19552397853
8    4  16   4096     128   False    0.992287982604
9    4  16   4096     128    True    0.975264209835
10   4  32   4096      64   False    0.960293901543
11   4  32   4096      64    True    1.17439716888
12   8  16   2048     128   False    0.983495166614
13   8  16   2048     128    True    1.00029389437
14   8  32   2048      64   False    1.0007848101
15   8  32   2048      64    True   1.20433393254
16  16  16   1024     128   False    0.96880814982
17  16  16   1024     128    True    1.01326103123
18  16  32   1024      64   False   0.950964800075
19  16  32   1024      64    True   1.08050768492
20  32  16    512     128   False   0.979909264277
21  32  16    512     128    True   1.01801789953
22  32  32    512      64   False   0.954538727308
23  32  32    512      64    True   1.0598195142
24   4  48   1024      64   False   0.993554608016
25   4  48   1024      64    True   1.16116589681

Mean: 1.03248991302 Max: 1.21945316547 Min: 0.950964800075

Looks promising! Thanks @victor-eds .
I am looking forward to enabling this by default after you have a chance to test it on other workloads.

victor-eds · 2024-10-24T15:36:25Z

I am looking forward to enabling this by default after you have a chance to test it on other workloads.

I have filled a series of followup issues I'd like to address before a deeper performance investigation. I'll add these as a comment in this issue.

I would however call this issue done when the outstanding PRs are merged, as the new pass is better (and generates better code) than the one in the advanced path.

victor-eds · 2024-10-24T15:37:00Z

Followup issues to further improve performance:

victor-eds added the performance label Sep 17, 2024

victor-eds self-assigned this Sep 17, 2024

vlad-penkin added this to the 4.0 [Performance] Core milestone Sep 17, 2024

vlad-penkin added codegen: attention enhancement New feature or request labels Sep 17, 2024

victor-eds changed the title ~~Port #2109 to default path~~ Port "sub-group transpose reduction" to default path Sep 18, 2024

victor-eds removed their assignment Sep 18, 2024

victor-eds self-assigned this Sep 30, 2024

This was referenced Oct 15, 2024

[OptRed] Define -tritonintelgpu-optimize-reduction-locality pass #2491

Merged

[Triton] Use UnitAttr in tt.reshape definition #2497

Closed

This was linked to pull requests Oct 16, 2024

[Triton] Use UnitAttr in tt.reshape definition #2497

Closed

[OptRed] Define -tritonintelgpu-optimize-reduction-locality pass #2491

Merged

This was referenced Oct 21, 2024

[TritonIntelGPUToLLVM] Detect sub-group transpose convert_layout cases #2511

Merged

[OptRed] Extend -tritonintelgpu-optimize-reduction-locality to support repCluster[0] > 2 #2519

Closed

etiotto linked a pull request Oct 21, 2024 that will close this issue

[TritonIntelGPUToLLVM] Detect sub-group transpose convert_layout cases #2511

Merged

victor-eds mentioned this issue Oct 21, 2024

[TritonIntelGPUToLLVM] Extend sub-group transposition support #2521

Merged

This was linked to pull requests Oct 21, 2024

[TritonIntelGPUToLLVM] Extend sub-group transposition support #2521

Merged

[OptRed] Extend -tritonintelgpu-optimize-reduction-locality to support repCluster[0] > 2 #2519

Closed

victor-eds mentioned this issue Oct 22, 2024

[TritonIntelGPUToLLVM] Detect basic sub-group shuffle convert_layout cases #2531

Merged

etiotto closed this as completed in #2491 Oct 22, 2024

victor-eds reopened this Oct 22, 2024

victor-eds mentioned this issue Oct 22, 2024

[OptRed] Extend -tritonintelgpu-optimize-reduction-locality to support repCluster[0] > 2 #2533

Open

etiotto linked a pull request Oct 23, 2024 that will close this issue

[TritonIntelGPUToLLVM] Detect basic sub-group shuffle convert_layout cases #2531

Merged

victor-eds closed this as completed in #2511 Oct 24, 2024

victor-eds reopened this Oct 24, 2024

victor-eds linked a pull request Oct 24, 2024 that will close this issue

[XPU] Conditionally add -tritonintelgpu-optimize-reduction-locality to pipeline #2553

Open

victor-eds closed this as completed in #2531 Oct 24, 2024

victor-eds reopened this Oct 24, 2024

victor-eds closed this as completed in #2521 Oct 24, 2024

victor-eds reopened this Oct 24, 2024

victor-eds mentioned this issue Oct 24, 2024

Use linear layout to extend -tritonintelgpu-optimize-reduction-locality to other layouts #2563

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Port "sub-group transpose reduction" to default path #2266

Port "sub-group transpose reduction" to default path #2266

victor-eds commented Sep 17, 2024

victor-eds commented Oct 7, 2024

victor-eds commented Oct 15, 2024

victor-eds commented Oct 21, 2024

victor-eds commented Oct 24, 2024 •

edited

Loading

etiotto commented Oct 24, 2024

victor-eds commented Oct 24, 2024

victor-eds commented Oct 24, 2024

Port "sub-group transpose reduction" to default path #2266

Port "sub-group transpose reduction" to default path #2266

Comments

victor-eds commented Sep 17, 2024

victor-eds commented Oct 7, 2024

victor-eds commented Oct 15, 2024

victor-eds commented Oct 21, 2024

victor-eds commented Oct 24, 2024 • edited Loading

etiotto commented Oct 24, 2024

victor-eds commented Oct 24, 2024

victor-eds commented Oct 24, 2024

victor-eds commented Oct 24, 2024 •

edited

Loading