Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

MiGraphx CPU/GPU Status Tracking #325

Open
zjgarvey opened this issue Aug 14, 2024 · 3 comments
Open

MiGraphx CPU/GPU Status Tracking #325

zjgarvey opened this issue Aug 14, 2024 · 3 comments
Assignees

Comments

@zjgarvey
Copy link
Contributor

zjgarvey commented Aug 14, 2024

This issue will be used to track compilation failures for migraphx models on CPU and GPU. Compile failures for each model should have a link to an issue with a smaller reproducer in the notes column.

Notes:

  1. migraphx_ORT__bert_base_cased_1 fails on CPU but passes on GPU. Other adjacent models fail for similar reasons on both. Very odd.
  2. not including tests migraphx_sdxl__unet__model, migraphx_ORT__bert_large_uncased_1 because they cause a crash (likely OOM)
  3. not including any of the tf models yet.

CPU Status Table

The Following report was generated with IREE compiler version iree-org/iree@caacf6c
Torch-mlir version llvm/torch-mlir@2665ed3

Passing Summary

TOTAL TESTS = 30

Stage # Passing % of Total % of Attempted
Setup 30 100.0% 100.0%
IREE Compilation 24 80.0% 80.0%
Gold Inference 22 73.3% 91.7%
IREE Inference Invocation 19 63.3% 86.4%
Inference Comparison (PASS) 15 50.0% 78.9%

Fail Summary

TOTAL TESTS = 30

Stage # Failed at Stage % of Total
Setup 0 0.0%
IREE Compilation 6 20.0%
Gold Inference 2 6.7%
IREE Inference Invocation 3 10.0%
Inference Comparison 4 13.3%

Test Run Detail

Test was run with the following arguments:
Namespace(device='local-task', backend='llvm-cpu', iree_compile_args=None, mode='cl-onnx-iree', torchtolinalg=True, stages=None, skip_stages=None, benchmark=False, load_inputs=False, groups='all', test_filter='migraphx', testsfile=None, tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, cleanup='0', report=True, report_file='mi_10_10.md')

Test Exit Status Mean Benchmark Time (ms) Notes
migraphx_agentmodel__AgentModel compilation None iree-18268 iree-18412 torch-mlir-3651
migraphx_bert__bert-large-uncased preprocessing None
migraphx_bert__bertsquad-12 Numerics None
migraphx_cadene__dpn92i1 PASS None
migraphx_cadene__inceptionv4i16 PASS None
migraphx_cadene__resnext101_64x4di1 PASS None
migraphx_cadene__resnext101_64x4di16 PASS None
migraphx_huggingface-transformers__bert_mrpc8 native_inference None
migraphx_mlperf__bert_large_mlperf Numerics None
migraphx_mlperf__resnet50_v1 PASS None
migraphx_models__whisper-tiny-decoder compiled_inference None
migraphx_models__whisper-tiny-encoder native_inference None
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11 import_model None
migraphx_onnx-model-zoo__gpt2-10 preprocessing None
migraphx_ORT__bert_base_cased_1 PASS None
migraphx_ORT__bert_base_uncased_1 PASS None
migraphx_ORT__bert_large_uncased_1 PASS None
migraphx_ORT__distilgpt2_1 compiled_inference None
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu Numerics None
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu Numerics None
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu compiled_inference None
migraphx_pytorch-examples__wlang_gru PASS None
migraphx_pytorch-examples__wlang_lstm PASS None
migraphx_sd__unet__model import_model None
migraphx_sdxl__unet__model import_model None
migraphx_torchvision__densenet121i32 PASS None
migraphx_torchvision__inceptioni1 PASS None
migraphx_torchvision__inceptioni32 PASS None
migraphx_torchvision__resnet50i1 PASS None
migraphx_torchvision__resnet50i64 PASS None

OLD STATUS (Will update and migrate issues to current table)

Test Exit Status Notes
migraphx_agentmodel__AgentModel compilation
migraphx_bert__bert-large-uncased compilation iree-18269 Two IR reported under this, depicting different behavior
migraphx_bert__bertsquad-12 compilation iree-18267 torch-mlir-3647
migraphx_cadene__dpn92i1 PASS
migraphx_cadene__inceptionv4i16 PASS
migraphx_cadene__resnext101_64x4di1 PASS
migraphx_cadene__resnext101_64x4di16 PASS
migraphx_huggingface-transformers__bert_mrpc8 compilation iree-18413
migraphx_mlperf__bert_large_mlperf compilation iree-18297
migraphx_mlperf__resnet50_v1 PASS
migraphx_models__whisper-tiny-decoder compilation torch-mlir-3647
migraphx_models__whisper-tiny-encoder compilation torch-mlir-3647
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11 construct_inputs ORT issue with resize with f16 inputs?
migraphx_onnx-model-zoo__gpt2-10 compilation shark-turbine-465 torch-mlir-615 torch-mlir-3293
migraphx_ORT__bert_base_cased_1 Numerics Passed when '--iree-input-demote-i64-to-i32' is not present iree-18273
migraphx_ORT__bert_base_uncased_1 Numerics Passed when '--iree-input-demote-i64-to-i32' is not present
migraphx_ORT__bert_large_uncased_1 compilation crashes "MatMul" fail to legalize stream.cmd.dispatch iree-org/iree#18229 llvm/torch-mlir#3647 ??
migraphx_ORT__distilgpt2_1 Numerics
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu Numerics
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu Numerics
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu Numerics
migraphx_pytorch-examples__wlang_gru Numerics iree-18441
migraphx_pytorch-examples__wlang_lstm Numerics iree-18441
migraphx_sd__unet__model import_model Killed during MLIR import. Too big??
migraphx_sdxl__unet__model import_model Killed during MLIR import. Too big??
migraphx_torchvision__densenet121i32 PASS
migraphx_torchvision__inceptioni1 PASS
migraphx_torchvision__inceptioni32 PASS
migraphx_torchvision__resnet50i1 PASS
migraphx_torchvision__resnet50i64 PASS

GPU Status Table

last generated with pip installed iree tools at version

iree-compiler      20240903.1005
iree-runtime       20240903.1005

Summary

Stage Count
Total 21 (non-crashing, see table below)
PASS 12
Numerics 2
results-summary 0
postprocessing 0
compiled_inference up to 5 (not included in total) crash during this stage
compilation 4
preprocessing 0
import_model 1
native_inference 2
construct_inputs 0
setup 0

Test Run Detail

Test was run with the following arguments:
Namespace(device='hip://1', backend='rocm', iree_compile_args=['iree-hip-target=gfx942'], mode='onnx-iree', torchtolinalg=False, stages=None, skip_stages=None, load_inputs=False, groups='all', test_filter='migraphx', tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, report=True, report_file='9_3_migraphx.md')

Test Exit Status Notes
migraphx_agentmodel__AgentModel compilation related : llvm/torch-mlir#3630
migraphx_bert__bert-large-uncased compilation operand return type issue (see CPU table)
migraphx_bert__bertsquad-12 compilation (without shape inference)/ compiled_inference 1. Failing to use shape inference torch-mlir passes in torch-to-iree pipeline gives an all dynamic squeeze-dim op. 2. If using torch-lower-to-backend-contract to get the shape information, this crashes during inference with OOB memory access
migraphx_cadene__dpn92i1 PASS
migraphx_cadene__inceptionv4i16 PASS
migraphx_cadene__resnext101_64x4di1 PASS
migraphx_cadene__resnext101_64x4di16 PASS
migraphx_huggingface-transformers__bert_mrpc8 native_inference
migraphx_mlperf__bert_large_mlperf native_inference
migraphx_mlperf__resnet50_v1 PASS
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11 import_model
migraphx_onnx-model-zoo__gpt2-10 compilation nod-ai/SHARK-ModelDev#465 llvm/torch-mlir#615 llvm/torch-mlir#3293
migraphx_ORT__bert_base_cased_1 PASS
migraphx_ORT__bert_base_uncased_1 PASS
migraphx_ORT__distilgpt2_1 likely compiled_inference crashes with "Memory access fault by GPU node-3 (Agent handle: 0x5595fe450840) on address 0x7f1811a56000. Reason: Unknown."
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu compiled_inference causes a hard crash for trying to access memory out of bounds (Mi300x)
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu compiled_inference same crash as above
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu likely compiled_inference crashes with "Memory access fault by GPU node-3 (Agent handle: 0x5595fe450840) on address 0x7f1811a56000. Reason: Unknown."
migraphx_pytorch-examples__wlang_gru Numerics
migraphx_pytorch-examples__wlang_lstm Numerics
migraphx_torchvision__densenet121i32 PASS
migraphx_torchvision__inceptioni1 PASS
migraphx_torchvision__inceptioni32 PASS
migraphx_torchvision__resnet50i1 PASS
migraphx_torchvision__resnet50i64 PASS

Note: GPU missing sd model (runs out of memory and kills the test). Probably happening during native inference, so it might need some looking into.

Performance data with iree-benchmark-module on GPU

Summary

Stage Count
Total 30
PASS 13
Numerics 3
results-summary 0
postprocessing 0
benchmark 0
compiled_inference 2
native_inference 1
construct_inputs 0
compilation 8
preprocessing 0
import_model 3
setup 0

Test Run Detail

Test was run with the following arguments:
Namespace(device='local-task', backend='llvm-cpu', iree_compile_args=None, mode='cl-onnx-iree', torchtolinalg=False, stages=None, skip_stages=None, benchmark=True, load_inputs=False, groups='all', test_filter='migraphx', testsfile=None, tolerance=None, verbose=True, rundirectory='test-run', no_artifacts=False, cleanup='0', report=True, report_file='report.md')

Test Exit Status Mean Benchmark Time (ms) Notes
migraphx_agentmodel__AgentModel compilation None
migraphx_bert__bert-large-uncased compilation None
migraphx_bert__bertsquad-12 compilation None
migraphx_cadene__dpn92i1 PASS 457.4397828740378
migraphx_cadene__inceptionv4i16 PASS 26072.668661984306
migraphx_cadene__resnext101_64x4di1 PASS 995.6825857516378
migraphx_cadene__resnext101_64x4di16 PASS 6324.309662605326
migraphx_huggingface-transformers__bert_mrpc8 compilation None
migraphx_mlperf__bert_large_mlperf PASS 8195.630943014596
migraphx_mlperf__resnet50_v1 PASS 219.81522629761858
migraphx_models__whisper-tiny-decoder compiled_inference None
migraphx_models__whisper-tiny-encoder native_inference None
migraphx_onnx-misc__taau_low_res_downsample_d2s_for_infer_time_fp16_opset11 import_model None
migraphx_onnx-model-zoo__gpt2-10 compilation None
migraphx_ORT__bert_base_cased_1 PASS 817.4834945239127
migraphx_ORT__bert_base_uncased_1 compilation None
migraphx_ORT__bert_large_uncased_1 PASS 2728.984761983156
migraphx_ORT__distilgpt2_1 compiled_inference None
migraphx_ORT__onnx_models__bert_base_cased_1_fp16_gpu Numerics 2141.3577783387154
migraphx_ORT__onnx_models__bert_large_uncased_1_fp16_gpu Numerics 6767.566671983029
migraphx_ORT__onnx_models__distilgpt2_1_fp16_gpu Numerics 101.96079453453422
migraphx_pytorch-examples__wlang_gru compilation None
migraphx_pytorch-examples__wlang_lstm compilation None
migraphx_sd__unet__model import_model None
migraphx_sdxl__unet__model import_model None
migraphx_torchvision__densenet121i32 PASS 2639.900082334255
migraphx_torchvision__inceptioni1 PASS 627.4162046611309
migraphx_torchvision__inceptioni32 PASS 22124.727455200627
migraphx_torchvision__resnet50i1 PASS 284.1490000589854
migraphx_torchvision__resnet50i64 PASS 11100.900294492021
@nirvedhmeshram
Copy link

@zjgarvey added llvm/torch-mlir#3647 to some of the models as we need that along with iree-org/iree#18229

@MaheshRavishankar
Copy link

cc @lialan as well. Can you co-ordinate with Zach to track CPU codegen issues.

@nirvedhmeshram
Copy link

Also adding llvm/torch-mlir#3651 that needs to be done for supporting broad range of models.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants