Skip to content

Commit

Permalink
Updated benchmarks for latest version
Browse files Browse the repository at this point in the history
  • Loading branch information
artyom-beilis committed Sep 4, 2024
1 parent ad0185f commit bdda551
Show file tree
Hide file tree
Showing 15 changed files with 277 additions and 144 deletions.
197 changes: 54 additions & 143 deletions benchmark.md
Original file line number Diff line number Diff line change
@@ -1,143 +1,54 @@
# Benchmarks

Below benchmarks done for comparison of rx6600xt and gtx960 - GPUs
of cuda and rocm backends vs `pytorch_ocl`

Depending on the network training performance is around 60 to 90 percent
inference performance is somewhat better.

Notes: time in ms per batch - smaller is better, input is standard imagenet
input Batchx3x224x224


## Training


rx6600xt/8gb batch size rocm/hip opencl Raito %
alexnet 64 57.848 82.381 70.2
resnet18 64 146.917 238.889 61.5
resnet50 32 266.441 357.985 74.4
convnext_small 16 337.252 583.794 57.8
vgg16 16 206.312 348.692 59.2
densenet161 16 296.807 485.035 61.2
mobilenet_v2 32 157.476 197.886 79.6
mobilenet_v3_small 64 92.506 120.406 76.8
mobilenet_v3_large 64 286.795 319.938 89.6
resnext50_32x4d 32 336.464 491.112 68.5
wide_resnet50_2 32 466.841 642.973 72.6
mnasnet1_0 32 159.97 167.306 95.6
efficientnet_b0 32 205.69 305.157 67.4
regnet_y_400mf 64 171.691 244.587 70.2

Average 71.8
gtx960/4gb batch size c cuda opencl Raito %
alexnet 64 128.142 270.006 47.5
resnet18 64 415.589 746.578 55.7
resnet50 16 373.932 599.182 62.4
convnext_small 8 1128.995 1175.585 96.0
vgg16 8 364.176 561.695 64.8
densenet161 8 463.427 728.693 63.6
mobilenet_v2 16 173.13 352.728 49.1
mobilenet_v3_small 32 101.621 206.353 49.2
mobilenet_v3_large 32 263.055 523.575 50.2
resnext50_32x4d 16 539.007 846.71 63.7
wide_resnet50_2 16 677.57 1040.154 65.1
mnasnet1_0 16 167.542 322.004 52.0
efficientnet_b0 16 241.023 540.09 44.6
regnet_y_400mf 32 353.889 391.025 90.5

Average 61.0
## Inference

Note, since my AMD and Nvidia gpus have different memory size differnet
batch sizes were used


rx6600xt/8gb rocm/hip opencl Ratio % Batch=64
convnext_small 476.549 600.921 79.3
alexnet 24.587 26.311 93.4
resnet18 41.375 59.375 69.7
resnet50 165.261 194.512 85.0
vgg16 205.124 309.937 66.2
densenet161 409.38 414.496 98.8
inception_v3 90.635 131.685 68.8
mobilenet_v2 77.691 93.701 82.9
mobilenet_v3_small 22.203 26.151 84.9
mobilenet_v3_large 63.229 70.458 89.7
resnext50_32x4d 244.676 274.791 89.0
wide_resnet50_2 320.313 402.687 79.5
mnasnet1_0 74.141 75.162 98.6
efficientnet_b0 104.396 114.898 90.9
efficientnet_b4 303.468 276.226 109.9
regnet_y_400mf 43.298 57.491 75.3

Average 85.1
gtx960/4gb cuda opencl Ratio % Batch=32
convnext_small 751.713 1206.871 62.3
alexnet 29.446 44.27 66.5
resnet18 66.053 93.352 70.8
resnet50 214.787 316.754 67.8
vgg16 350.278 486.743 72.0
densenet161 511.183 587.856 87.0
inception_v3 167.233 217.664 76.8
mobilenet_v2 86.572 161.797 53.5
mobilenet_v3_small 27.748 49.359 56.2
mobilenet_v3_large 68.79 121.644 56.6
resnext50_32x4d 284.697 440.466 64.6
wide_resnet50_2 376.114 587.801 64.0
mnasnet1_0 82.576 132.463 62.3
efficientnet_b0 111.154 202.593 54.9
efficientnet_b4 299.779 499.841 60.0
regnet_y_400mf 99.336 95.446 104.1

Average 67.5
ppppppppppppppppppp
# Setup

Tested 3 setups, pytorch 2.4

1. AMD rx6600XT, OpenCL drivers vs official ROCM pytorch (6.1)
2. NVidia rx960, OpenCL drivers vs official CUDA 12.2
3. Inter Arc A380, OpenCL NEO driver vs XPU - intel extension for pytorch (2.1 since it what was released)

Input is standard Image net batchx3x224x224, time in milliseconds, lower is better.

# Training



|AMD||||||Nvidia||||||Intel|||||
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|rx6600xt|batch|OpenCL|ROCM|% Perf||gtx960|batch|OpenCL|CUDA|% Perf||A380|batch|OpenCL|XPU|% Perf|
|alexnet|64|75.239|57.957|77||alexnet|64|257.09|130.561|51||alexnet|64|482.139|133.512|28|
|resnet18|64|238.927|147.099|62||resnet18|64|695.096|419.69|60||resnet18|64|1044.985|397.738|38|
|resnet50|32|358.872|266.155|74||resnet50|16|591.143|375.644|64||resnet50|16|640.916|329.849|51|
|convnext_small|16|608.297|337.736|56||convnext_small|8|1001.294|1120.676|112||convnext_small|8|841.302|259.292|31|
|vgg16|16|343.962|206.243|60||vgg16|8|520.75|363.288|70||vgg16|8|780.692|479.314|61|
|densenet161|16|494.175|297.001|60||densenet161|8|698.842|464.051|66||densenet161|8|834.207|423.883|51|
|mobilenet_v2|32|206.255|157.743|76||mobilenet_v2|16|335.279|173.748|52||mobilenet_v2|16|405.541|153.694|38|
|mobilenet_v3_small|64|130.571|92.83|71||mobilenet_v3_small|32|196.173|102.561|52||mobilenet_v3_small|32|275.302|92.086|33|
|mobilenet_v3_large|64|330.269|287.3|87||mobilenet_v3_large|32|497.168|264.072|53||mobilenet_v3_large|32|642.568|226.292|35|
|resnext50_32x4d|32|490.971|336.183|68||resnext50_32x4d|16|807.178|539.026|67||resnext50_32x4d|16|1068.918|396.39|37|
|wide_resnet50_2|32|643.083|468.04|73||wide_resnet50_2|16|1023.105|677.723|66||wide_resnet50_2|16|1373.346|634.213|46|
|mnasnet1_0|32|167.934|160.254|95||mnasnet1_0|16|302.854|167.911|55||mnasnet1_0|16|383.069|126.56|33|
|efficientnet_b0|32|313.972|205.674|66||efficientnet_b0|16|515.058|241.311|47||efficientnet_b0|16|531.724|203.157|38|
|regnet_y_400mf|64|246.069|171.841|70||regnet_y_400mf|32|361.507|353.584|98||regnet_y_400mf|32|635.279|224.228|35|
|Average||||71||Average||||65||Average||||40|

# Inference


|AMD||||||Nvidia||||||Intel|||||
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|rx6600xt|batch|OpenCL|ROCM|% Perf||gtx960|batch|OpenCL|CUDA|% Perf||A380|batch|OpenCL|XPU|% Perf|
|alexnet|64|24.543|24.642|100||alexnet|32|45.007|30.271|67||alexnet|32|55.5|25.835|47|
|resnet18|64|59.428|41.569|70||resnet18|32|94.044|66.61|71||resnet18|32|113.002|55.647|49|
|resnet50|64|196.75|165.706|84||resnet50|32|316.899|215.245|68||resnet50|32|271.778|145.842|54|
|convnext_small|64|632.215|478.088|76||convnext_small|32|881.586|751.286|85||convnext_small|32|670.291|294.405|44|
|vgg16|64|310.767|205.745|66||vgg16|32|490.68|351.488|72||vgg16|32|801.684|333.954|42|
|densenet161|64|415.707|410.906|99||densenet161|32|589.712|510.883|87||densenet161|32|685.154|315.407|46|
|mobilenet_v2|64|93.699|77.774|83||mobilenet_v2|32|162.4|87.376|54||mobilenet_v2|32|100.363|51.589|51|
|mobilenet_v3_small|64|25.653|22.253|87||mobilenet_v3_small|32|50.097|28.739|57||mobilenet_v3_small|32|36.92|26.508|72|
|mobilenet_v3_large|64|70.409|63.28|90||mobilenet_v3_large|32|122.416|69.432|57||mobilenet_v3_large|32|84.413|52.328|62|
|resnext50_32x4d|64|274.967|245.411|89||resnext50_32x4d|32|440.411|284.571|65||resnext50_32x4d|32|359.037|169.194|47|
|wide_resnet50_2|64|404.214|321.398|80||wide_resnet50_2|32|589.164|376.938|64||wide_resnet50_2|32|682.184|321.014|47|
|mnasnet1_0|64|75.027|74.211|99||mnasnet1_0|32|133.324|83.407|63||mnasnet1_0|32|91.441|51.785|57|
|efficientnet_b0|64|114.735|104.417|91||efficientnet_b0|32|203.531|111.822|55||efficientnet_b0|32|129.755|88.131|68|
|regnet_y_400mf|64|57.408|43.313|75||regnet_y_400mf|32|96.079|99.022|103||regnet_y_400mf|32|87.756|56.503|64|
|Average||||85||Average||||69||Average||||54|
54 changes: 54 additions & 0 deletions benchmarks/v0.2.0/summary.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,54 @@
# Setup

Tested 3 setups, pytorch 2.4

1. AMD rx6600XT, OpenCL drivers vs official ROCM pytorch (6.1)
2. NVidia rx960, OpenCL drivers vs official CUDA 12.2
3. Inter Arc A380, OpenCL NEO driver vs XPU - intel extension for pytorch (2.1 since it what was released)

Input is standard Image net batchx3x224x224, time in milliseconds, lower is better.

# Training



|AMD||||||Nvidia||||||Intel|||||
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|rx6600xt|batch|OpenCL|ROCM|% Perf||gtx960|batch|OpenCL|CUDA|% Perf||A380|batch|OpenCL|XPU|% Perf|
|alexnet|64|75.239|57.957|77||alexnet|64|257.09|130.561|51||alexnet|64|482.139|133.512|28|
|resnet18|64|238.927|147.099|62||resnet18|64|695.096|419.69|60||resnet18|64|1044.985|397.738|38|
|resnet50|32|358.872|266.155|74||resnet50|16|591.143|375.644|64||resnet50|16|640.916|329.849|51|
|convnext_small|16|608.297|337.736|56||convnext_small|8|1001.294|1120.676|112||convnext_small|8|841.302|259.292|31|
|vgg16|16|343.962|206.243|60||vgg16|8|520.75|363.288|70||vgg16|8|780.692|479.314|61|
|densenet161|16|494.175|297.001|60||densenet161|8|698.842|464.051|66||densenet161|8|834.207|423.883|51|
|mobilenet_v2|32|206.255|157.743|76||mobilenet_v2|16|335.279|173.748|52||mobilenet_v2|16|405.541|153.694|38|
|mobilenet_v3_small|64|130.571|92.83|71||mobilenet_v3_small|32|196.173|102.561|52||mobilenet_v3_small|32|275.302|92.086|33|
|mobilenet_v3_large|64|330.269|287.3|87||mobilenet_v3_large|32|497.168|264.072|53||mobilenet_v3_large|32|642.568|226.292|35|
|resnext50_32x4d|32|490.971|336.183|68||resnext50_32x4d|16|807.178|539.026|67||resnext50_32x4d|16|1068.918|396.39|37|
|wide_resnet50_2|32|643.083|468.04|73||wide_resnet50_2|16|1023.105|677.723|66||wide_resnet50_2|16|1373.346|634.213|46|
|mnasnet1_0|32|167.934|160.254|95||mnasnet1_0|16|302.854|167.911|55||mnasnet1_0|16|383.069|126.56|33|
|efficientnet_b0|32|313.972|205.674|66||efficientnet_b0|16|515.058|241.311|47||efficientnet_b0|16|531.724|203.157|38|
|regnet_y_400mf|64|246.069|171.841|70||regnet_y_400mf|32|361.507|353.584|98||regnet_y_400mf|32|635.279|224.228|35|
|Average||||71||Average||||65||Average||||40|

# Inference


|AMD||||||Nvidia||||||Intel|||||
|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|-|
|rx6600xt|batch|OpenCL|ROCM|% Perf||gtx960|batch|OpenCL|CUDA|% Perf||A380|batch|OpenCL|XPU|% Perf|
|alexnet|64|24.543|24.642|100||alexnet|32|45.007|30.271|67||alexnet|32|55.5|25.835|47|
|resnet18|64|59.428|41.569|70||resnet18|32|94.044|66.61|71||resnet18|32|113.002|55.647|49|
|resnet50|64|196.75|165.706|84||resnet50|32|316.899|215.245|68||resnet50|32|271.778|145.842|54|
|convnext_small|64|632.215|478.088|76||convnext_small|32|881.586|751.286|85||convnext_small|32|670.291|294.405|44|
|vgg16|64|310.767|205.745|66||vgg16|32|490.68|351.488|72||vgg16|32|801.684|333.954|42|
|densenet161|64|415.707|410.906|99||densenet161|32|589.712|510.883|87||densenet161|32|685.154|315.407|46|
|mobilenet_v2|64|93.699|77.774|83||mobilenet_v2|32|162.4|87.376|54||mobilenet_v2|32|100.363|51.589|51|
|mobilenet_v3_small|64|25.653|22.253|87||mobilenet_v3_small|32|50.097|28.739|57||mobilenet_v3_small|32|36.92|26.508|72|
|mobilenet_v3_large|64|70.409|63.28|90||mobilenet_v3_large|32|122.416|69.432|57||mobilenet_v3_large|32|84.413|52.328|62|
|resnext50_32x4d|64|274.967|245.411|89||resnext50_32x4d|32|440.411|284.571|65||resnext50_32x4d|32|359.037|169.194|47|
|wide_resnet50_2|64|404.214|321.398|80||wide_resnet50_2|32|589.164|376.938|64||wide_resnet50_2|32|682.184|321.014|47|
|mnasnet1_0|64|75.027|74.211|99||mnasnet1_0|32|133.324|83.407|63||mnasnet1_0|32|91.441|51.785|57|
|efficientnet_b0|64|114.735|104.417|91||efficientnet_b0|32|203.531|111.822|55||efficientnet_b0|32|129.755|88.131|68|
|regnet_y_400mf|64|57.408|43.313|75||regnet_y_400mf|32|96.079|99.022|103||regnet_y_400mf|32|87.756|56.503|64|
|Average||||85||Average||||69||Average||||54|
14 changes: 14 additions & 0 deletions benchmarks/v0.2.0/test-A380-opencl-batch32.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
alexnet 55.500
resnet18 113.002
resnet50 271.778
convnext_small 670.291
vgg16 801.684
densenet161 685.154
mobilenet_v2 100.363
mobilenet_v3_small 36.920
mobilenet_v3_large 84.413
resnext50_32x4d 359.037
wide_resnet50_2 682.184
mnasnet1_0 91.441
efficientnet_b0 129.755
regnet_y_400mf 87.756
14 changes: 14 additions & 0 deletions benchmarks/v0.2.0/test-A380-xpu-batch32.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
alexnet 25.835
resnet18 55.647
resnet50 145.842
convnext_small 294.405
vgg16 333.954
densenet161 315.407
mobilenet_v2 51.589
mobilenet_v3_small 26.508
mobilenet_v3_large 52.328
resnext50_32x4d 169.194
wide_resnet50_2 321.014
mnasnet1_0 51.785
efficientnet_b0 88.131
regnet_y_400mf 56.503
14 changes: 14 additions & 0 deletions benchmarks/v0.2.0/test-gtx960-cuda-batch32.txt
Original file line number Diff line number Diff line change
@@ -0,0 +1,14 @@
alexnet 30.271
resnet18 66.610
resnet50 215.245
convnext_small 751.286
vgg16 351.488
densenet161 510.883
mobilenet_v2 87.376
mobilenet_v3_small 28.739
mobilenet_v3_large 69.432
resnext50_32x4d 284.571
wide_resnet50_2 376.938
mnasnet1_0 83.407
efficientnet_b0 111.822
regnet_y_400mf 99.022
Loading

0 comments on commit bdda551

Please sign in to comment.