You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
Thank you for writing such an awesome library, I think your contribution to the world of open source is really great.
Unfortunately I have noticed that the performance of the CLBlast GEMM really isn’t much better than the multiplication on my CPU using standard Eigen. It is perhaps a factor of 2 or 3 faster. I would have thought this would be much better. I ran all-tuners, updated the optimisation results and recompiled as described on the optimisations page. I am running on the AMD RX-Vega 64 GPU as within the optimization results I recently uploaded. For the tuners/compilation do I need to enable some sort of extra flag for the AMD architecture?
Any help would be appreciated, I really would rather stick with OpenCLblast and pass this on to the users of my library.
The text was updated successfully, but these errors were encountered:
Indeed, you don't need to do anything special to use the tuning results, except for making sure you use the latest version and have recompiled the library of course.
About the speed issue, this can depend on a lot of factors. Here are some steps to follow to get a bit more insight into your issue (which I did actually):
Typically it is good to compare the peak performance of the device against what you get with CLBlast. You won't get 100%, but something above 50% should be attainable. According to wikipedia your Vega 64 should get around 10.000 GFLOPS peak, assuming we are talking about single precision (SGEMM).
Now that we know that, let's look at what you got when running the tuners. Easiest is to look at your logs (or re-run the tuner), because it will tell you the number in GFLOPS when running the xgemm tuner in 32-bit precision. Alternatively, in the final database JSON results it also shows the same information, but then measured in execution time. From your data I see that 0.43 ms was the best you got for 1024x1024x1024, which translates to around 5000 GFLOPS if my calculations are correct. That is about a factor 2 off of what we should hope to get in theory, but not that bad. There seems to be something special about the Vega architecture that CLBlast doesn't optimize for. Other reports here have shown that it is not easy to get good performance, so we shouldn't expect much beyond that number.
If the numbers in point 2 are good but your final benchmark wouldn't, then something else could be wrong, e.g. your matrix sizes are too small or some other overhead in CLBlast starts to become an issue. Or you are not measuring correctly. So how are you measuring this? With the CLBlast 'client' software that does this measurement for you, or your own measurement? If it is the latter, can you try compiling the clients and run something like ./clblast_client_xgemm -n 1024 -m 1024 -k 1024?
Other than that, the most useful piece of information here would be whether you can run a 1024x1024x1024 SGEMM with other software on your Vega GPU and report what you get. For example with AMD's ROCm BLAS. Then we can be sure it is a CLBlast issue.
Hello CLBLast group,
Thank you for writing such an awesome library, I think your contribution to the world of open source is really great.
Unfortunately I have noticed that the performance of the CLBlast GEMM really isn’t much better than the multiplication on my CPU using standard Eigen. It is perhaps a factor of 2 or 3 faster. I would have thought this would be much better. I ran all-tuners, updated the optimisation results and recompiled as described on the optimisations page. I am running on the AMD RX-Vega 64 GPU as within the optimization results I recently uploaded. For the tuners/compilation do I need to enable some sort of extra flag for the AMD architecture?
Any help would be appreciated, I really would rather stick with OpenCLblast and pass this on to the users of my library.
The text was updated successfully, but these errors were encountered: