From a46dbfd721ef64547f51fbb551295039e6407975 Mon Sep 17 00:00:00 2001 From: Neil Shephard Date: Thu, 14 Sep 2023 16:03:09 +0100 Subject: [PATCH] Add excerpt_separator to GPU Benchmarking post Closes #705 --- ...-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus.md | 6 ++++-- 1 file changed, 4 insertions(+), 2 deletions(-) diff --git a/_posts/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus.md b/_posts/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus.md index be6f797b..0c561fee 100644 --- a/_posts/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus.md +++ b/_posts/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus.md @@ -8,7 +8,7 @@ tags: GPU FLAMEGPU benchmarking category: link: description: -social_image: +social_image: type: text excerpt_separator: --- @@ -40,6 +40,8 @@ The existing A100 nodes each contain 4 GPUs which are directly connected to one The NVLink interconnect offers higher memory bandwidth for GPU to GPU communication, which combined with twice as many GPUs per node may lead to shorter application run-times than offered by the H100 nodes. If even more GPUs are required moving to the Tier 2 systems may be required, with Jade 2 offering up to 8 GPUs per Job, and Bede being the only current option for multi-node GPU jobs, with up to 128 GPUs per job. + + Within Stanage, software may need recompiling to run on the H100 nodes, or new versions of libraries may be required. For more information see the [HPC Documentation][stanage-using-gpus]. Carl Kennedy and Nicholas Musembi of the Research and Innovation Team in IT Services have [benchmarked these new GPUs using popular machine learning frameworks][h100-rcg-ml-benchmark], however not all HPC workloads exhibit the same performance characteristics as machine learning. @@ -144,7 +146,7 @@ When using Run-time compilation, performance improves significantly. This is in Using the much more work efficient Spatial 3D communication strategy, simulation run-times are significantly quicker than any of the brute-force benchmarks, with the largest simulations taking at most `0.944`s rather than `1457`s. On average, each agent is only reading `204.5` messages, rather than all `1000000` messages each agent must read in the bruteforce case. This greatly reduces the number of global memory reads performed and subsequently the impact of RTC is diminished although still significant. -As the initial density of the simulations and communication radius are maintained as the population is scaled, the average number of relevant messages is roughly comparable at each scale, resulting in a more linear relationship between simulation time and population size. +As the initial density of the simulations and communication radius are maintained as the population is scaled, the average number of relevant messages is roughly comparable at each scale, resulting in a more linear relationship between simulation time and population size. ![Figure 4: Circles Spatial3D - Mean Simulation Time (s) against Population Size](/assets/images/2023-08-18-benchmarking-flamegpu2-on-h100-a100-and-v100-gpus/plot-h100-a100-v100-cuda-118-fixed-density-circles_spatial3D.png)