Provide benchmark with throughput units (GFlops/s TFlops/s) #26

mratsim · 2024-04-20T10:12:40Z

Hello fellow gemm optimizer enthusiast,

It would be extremely useful to provide benchmark utilities, ideally in GFlop/s TFlop/s to compare with other frameworks, compare with the CPU peak theoretical throughput and also linpack.

The formula for MxK multiplied by KxN matrices is:

total required operations: M*K*N*2 2 for 1mul and 1add
divided by time taken

Additionally you might want to check the required data to derive arithmetic intensity for the roofline model:

required data: M*K+K*N

And finally you might also want to check your theoretical peak like: https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/gemm_bench_config.nim#L5-L18

const
  CpuGhz = 3.5      # i9-9980XE OC All turbo 4.1GHz (AVX2 4.0GHz, AVX512 3.5GHz)
  NumCpuCores = 18
  VectorWidth = 16  # 8 float32 for AVX2, 16 for AVX512
  InstrCycle = 2    # How many instructions per cycle, (2xFMAs or 1xFMA for example)
  FlopInstr = 2     # How many FLOP per instr (FMAs = 1 add + 1 mul)

  TheoSerialPeak* = CpuGhz * VectorWidth * InstrCycle * FlopInstr
  TheoThreadedPeak* = TheoSerialPeak * NumCpuCores

FYI, you might be interested in my own research in cache utilization tuning, though skimming a bit I see that you tuned at the cache associativity-level while I used some heuristics:

Use optimal kernel parameters (architectures, matrix layouts) bluss/matrixmultiply#34 (comment)

Benchmarks in my own implementation+OpenMP and OpenBLAS/MKL and MKL-DNN (Latest oneDNN was too entangled to extract the relevant GEMM primitives):

https://github.com/mratsim/laser
https://github.com/mratsim/laser/blob/d310294/benchmarks/gemm/gemm_bench_float32.nim#L374
Nim must be installed, and OpenBLAS or MKL and then (the submodule will download MKL-DNN)
```
git clone https://github.com/mratsim/laser
cd laser
git submodule init
nim cpp -r -d:danger -d:openmp --outdir:build benchmarks/gemm/gemm_bench_float32.nim
```

Benchmarks with my own multithreading runtime (instead of OpenMP)

https://github.com/mratsim/weave
https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/all_gemm.nim
Nim must be installed, and OpenBLAS or MKL and then (the submodule will download MKL-DNN)
```
git clone https://github.com/mratsim/weave
cd weave
nim c -r -d:danger -threads:on --outdir:build benchmarks/matmul_gemm_blas/all_gemm.nim
```
If using Intel MKL, library path can be customized here https://github.com/mratsim/weave/blob/b6255af/benchmarks/matmul_gemm_blas/all_gemm.nim

The text was updated successfully, but these errors were encountered:

sarah-quinones · 2024-04-21T05:08:08Z

thanks for the suggestion. I'll set up something for that soon

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

mratsim commented Apr 20, 2024

sarah-quinones commented Apr 21, 2024

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

Provide benchmark with throughput units (GFlops/s TFlops/s) #26

Comments

mratsim commented Apr 20, 2024

sarah-quinones commented Apr 21, 2024