-
Notifications
You must be signed in to change notification settings - Fork 181
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
[AMD] [ROCm] Pick num_warps
based on platform
#326
base: main
Are you sure you want to change the base?
Conversation
@@ -194,7 +194,7 @@ def cross_entropy_forward(_input, target, ignore_index, label_smoothing, reducti | |||
BLOCK_SIZE=BLOCK_SIZE, | |||
# TODO: 32 seems to give the best performance | |||
# Performance is quite sensitive to num_warps | |||
num_warps=32, | |||
num_warps=32 if not is_hip() else 16, |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
No a hardware expert, but can we perhaps benchmark num_warps=8
a bit as in vllm or autotune [32, 16] for NV and [16, 8] for AMD (in case of register spilling etc.)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I have added an extensive parameter search space sweep for layer norm, yet, it still cannot outperform huggingface at smaller dimension space
Added Code
def get_amd_triton_config_list():
waves_per_eu = [0, 1, 2]
matrix_instr_nonkdim = [16, 32]
num_stages=[0,1,2]
num_warps=[4, 8, 16]
config_list = []
for wpe in waves_per_eu:
for kdim in matrix_instr_nonkdim:
for ns in num_stages:
for nw in num_warps:
config_list.append(
triton.Config(
{
"waves_per_eu": wpe,
"matrix_instr_nonkdim": kdim,
},
num_stages=ns,
num_warps=nw,
)
)
return config_list
@triton.autotune(
configs=get_amd_triton_config_list(),
key=["BLOCK_SIZE"]
)
Benchmark results
OPTIMIZE_EPILOGUE=1 TRITON_PRINT_AUTOTUNING=1 python scripts/benc
hmark_layer_norm.py
**************************************
BENCHMARKING SPEED for LAYER_NORM
**************************************
Triton autotuning for function _layer_norm_forward_kernel finished after 12.43s; best config selected: waves_per_eu: 1, matrix_instr_nonkdim: 16, num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function _layer_norm_forward_kernel finished after 12.24s; best config selected: waves_per_eu: 0, matrix_instr_nonkdim: 32, num_warps: 4, num_ctas: 1, num_stages: 2, maxnreg: None;
Triton autotuning for function _layer_norm_forward_kernel finished after 12.22s; best config selected: waves_per_eu: 2, matrix_instr_nonkdim: 32, num_warps: 8, num_ctas: 1, num_stages: 0, maxnreg: None;
Triton autotuning for function _layer_norm_forward_kernel finished after 12.17s; best config selected: waves_per_eu: 0, matrix_instr_nonkdim: 16, num_warps: 8, num_ctas: 1, num_stages: 0, maxnreg: None;
Triton autotuning for function _layer_norm_forward_kernel finished after 12.54s; best config selected: waves_per_eu: 1, matrix_instr_nonkdim: 16, num_warps: 8, num_ctas: 1, num_stages: 1, maxnreg: None;
Triton autotuning for function _layer_norm_backward_kernel finished after 12.19s; best config selected: waves_per_eu: 2, matrix_instr_nonkdim: 32, num_warps: 4, num_ctas: 1, num_stages: 1, maxnreg: None;
Triton autotuning for function _layer_norm_backward_kernel finished after 12.11s; best config selected: waves_per_eu: 2, matrix_instr_nonkdim: 32, num_warps: 8, num_ctas: 1, num_stages: 0, maxnreg: None;
Triton autotuning for function _layer_norm_backward_kernel finished after 12.11s; best config selected: waves_per_eu: 0, matrix_instr_nonkdim: 32, num_warps: 16, num_ctas: 1, num_stages: 0, maxnreg: None;
Triton autotuning for function _layer_norm_backward_kernel finished after 12.51s; best config selected: waves_per_eu: 1, matrix_instr_nonkdim: 32, num_warps: 8, num_ctas: 1, num_stages: 0, maxnreg: None;
Triton autotuning for function _layer_norm_backward_kernel finished after 12.70s; best config selected: waves_per_eu: 1, matrix_instr_nonkdim: 16, num_warps: 8, num_ctas: 1, num_stages: 0, maxnreg: None;
********** Benchmark Data **********
[
{
"kernel_name": "layer_norm",
"kernel_provider": "liger",
"metric_name": "speed",
"metric_unit": "ms",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
0.05548600107431412,
0.0498029999434948,
0.060717999935150146,
0.09918499737977982,
0.17343400418758392
],
"y_values_20": [
0.05348199978470802,
0.047047000378370285,
0.05381200090050697,
0.09738200157880783,
0.16993799805641174
],
"y_values_80": [
0.058132000267505646,
0.05293000116944313,
0.08238700032234192,
0.10034800320863724,
0.17636018991470337
],
"timestamp": "2024-10-28 04:43:07",
"kernel_operation_mode": "forward",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
},
{
"kernel_name": "layer_norm",
"kernel_provider": "huggingface",
"metric_name": "speed",
"metric_unit": "ms",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
0.02472599968314171,
0.03308499976992607,
0.05716999992728233,
0.11405900120735168,
0.22450999915599823
],
"y_values_20": [
0.023934999480843544,
0.0322830006480217,
0.05523499846458435,
0.11289700120687485,
0.22289039194583893
],
"y_values_80": [
0.026359200477600098,
0.06141600012779236,
0.05879399925470352,
0.11538200080394745,
0.22627399861812592
],
"timestamp": "2024-10-28 04:43:10",
"kernel_operation_mode": "forward",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
},
{
"kernel_name": "layer_norm",
"kernel_provider": "liger",
"metric_name": "speed",
"metric_unit": "ms",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
0.3861970007419586,
0.9399949908256531,
0.9476320147514343,
1.0064010620117188,
1.017171025276184
],
"y_values_20": [
0.3667530119419098,
0.8674409985542297,
0.6628599762916565,
0.855912983417511,
0.8749900460243225
],
"y_values_80": [
0.422760009765625,
0.9504649639129639,
0.9593693614006042,
1.0176535844802856,
1.035987138748169
],
"timestamp": "2024-10-28 04:44:14",
"kernel_operation_mode": "full",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
},
{
"kernel_name": "layer_norm",
"kernel_provider": "huggingface",
"metric_name": "speed",
"metric_unit": "ms",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
0.3276045024394989,
0.3255690038204193,
0.34742000699043274,
0.4774940013885498,
0.9882450103759766
],
"y_values_20": [
0.32131001353263855,
0.32023200392723083,
0.34053999185562134,
0.4757609963417053,
0.9819509983062744
],
"y_values_80": [
0.33827799558639526,
0.3349609971046448,
0.3595069944858551,
0.4794589877128601,
0.9944999814033508
],
"timestamp": "2024-10-28 04:44:17",
"kernel_operation_mode": "full",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
}
]
**************************************
BENCHMARKING MEMORY for LAYER_NORM
**************************************
********** Benchmark Data **********
[
{
"kernel_name": "layer_norm",
"kernel_provider": "liger",
"metric_name": "memory",
"metric_unit": "MB",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
82.4375,
164.84375,
329.65625,
659.28125,
1320.53125
],
"y_values_20": [
82.4375,
164.84375,
329.65625,
659.28125,
1320.53125
],
"y_values_80": [
82.4375,
164.84375,
329.65625,
659.28125,
1320.53125
],
"timestamp": "2024-10-28 04:44:17",
"kernel_operation_mode": "full",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
},
{
"kernel_name": "layer_norm",
"kernel_provider": "huggingface",
"metric_name": "memory",
"metric_unit": "MB",
"gpu_name": "AMD Instinct MI300X",
"x_name": "N",
"x_label": "hidden size",
"x_values": [
1024,
2048,
4096,
8192,
16384
],
"y_values_50": [
80.5625,
161.09375,
322.15625,
644.28125,
1288.53125
],
"y_values_20": [
80.5625,
161.09375,
322.15625,
644.28125,
1288.53125
],
"y_values_80": [
80.5625,
161.09375,
322.15625,
644.28125,
1288.53125
],
"timestamp": "2024-10-28 04:44:17",
"kernel_operation_mode": "full",
"extra_benchmark_config_str": "{\"M\": 4096, \"dtype\": \"torch.float32\", \"eps\": 1e-06}",
"liger_version": "0.3.1"
}
]
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
There are no noticeable performance gain in autotuning. So I would suggest to keep things simple, just set the num_warps
to be 16
is sufficient for now.
Summary
This is a PR to enable the kernel to run on AMD GPUs through the initial changes to the
num_warps
.This change is proposed by @Edenzzzz and @DocShotgun in this issue #266
Testing Done
make test
to ensure correctnesstest/transformers/test_cross_entropy.py::test_correctness_with_ignore_index[10.0-dtype5-1e-08-1e-06-sum-2-4096-32000--100]
, the test passed. By runningpytest test/transformers/test_cross_entropy.py::test_correctness_with_ignore_index[10.0-dtype5-1e-08-1e-06-sum-2-4096-32000--100]
. However it will failed if there are other tests running before this test.make checkstyle
to ensure code stylemake test-convergence
to ensure convergenceFailure Test Logs (Click to expand/collapse)
```bash ============================================================= FAILURES ============================================================= ________________________ test_correctness_with_ignore_index[10.0-dtype5-1e-08-1e-06-sum-2-4096-32000--100] _________________________