Skip to content

Commit

Permalink
Merge pull request #256 from wilfonba/io
Browse files Browse the repository at this point in the history
File Per Process IO, performance summary in docs, new example case.
  • Loading branch information
henryleberre authored Dec 14, 2023
2 parents 82af415 + 61688cd commit 371c51a
Show file tree
Hide file tree
Showing 27 changed files with 2,306 additions and 220 deletions.
9 changes: 5 additions & 4 deletions CMakeLists.txt
Original file line number Diff line number Diff line change
Expand Up @@ -446,15 +446,16 @@ if (MFC_SYSCHECK)
endif()

if (MFC_DOCUMENTATION)
# Files in docs/examples are used to generate docs/documentation/examples.md
file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples/*")
# Files in examples/ are used to generate docs/documentation/examples.md
file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/examples/*")

add_custom_command(
OUTPUT "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md"
DEPENDS "${examples_DOCs}"
DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh;${examples_DOCs}"
COMMAND "bash" "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh"
"${CMAKE_CURRENT_SOURCE_DIR}"
COMMENT "Generating examples.md"
VERBATIM
)

file(GLOB common_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/*")
Expand Down Expand Up @@ -486,7 +487,7 @@ if (MFC_DOCUMENTATION)
"${CMAKE_CURRENT_BINARY_DIR}/${target}-Doxyfile" @ONLY)

set(opt_example_dependency "")
if (target STREQUAL "documentation")
if (${target} STREQUAL documentation)
set(opt_example_dependency "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md")
endif()

Expand Down
6 changes: 5 additions & 1 deletion docs/documentation/case.md
Original file line number Diff line number Diff line change
Expand Up @@ -344,6 +344,7 @@ Note that `time_stepper` $=$ 3 specifies the total variation diminishing (TVD),
| `format` | Integer | Output format. [1]: Silo-HDF5; [2] Binary |
| `precision` | Integer | [1] Single; [2] Double |
| `parallel_io` | Logical | Parallel I/O |
| `file_per_process` | Logical | Whether or not to write one IO file per process |
| `cons_vars_wrt` | Logical | Write conservative variables |
| `prim_vars_wrt` | Logical | Write primitive variables |
| `alpha_rho_wrt(i)` | Logical | Add the partial density of the fluid $i$ to the database \|
Expand Down Expand Up @@ -377,7 +378,10 @@ The table lists formatted database output parameters. The parameters define vari
With parallel I/O, MFC inputs and outputs a single file throughout pre-process, simulation, and post-process, regardless of the number of processors used.
Parallel I/O enables the use of different number of processors in each of the processes (i.e. simulation data generated using 1000 processors can be post-processed using a single processor).

- `cons_vars_wrt` and `prim_vars_wrt} activate output of conservative and primitive state variables into the database, respectively.
- `file_per_process` deactivates shared file MPI-IO and activates file per process MPI-IO. The default behaviour is to use a shared file.
File per process is usefull when running on 10's of thousands of ranks.

- `cons_vars_wrt` and `prim_vars_wrt` activate output of conservative and primitive state variables into the database, respectively.

- `[variable's name]_wrt` activates output of the each specified variable into the database.

Expand Down
78 changes: 78 additions & 0 deletions docs/documentation/expectedPerformance.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,78 @@
# Performance Results

MFC has been extensively benchmarked on CPUs and GPU devices.
A summary of these results follows.

## Expected time-steps/hour

The following table outlines expected performance in terms of the number of time steps per hour
(rounded to the nearest hundred) for various problem sizes (grid cells) and hardware for an inviscid, 6-equation (`model_eqns' : 3`), 3D simulation.
CPU results utilize an entire die.

| Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer |
| ---: | :----: | :----: | :---: | :---: | :----: | :--- |
| NVIDIA V100 | 1 | 88.5k | 18.7k | N/A | NVHPC 22.11 | PACE Phoenix |
| NVIDIA V100 | 1 | 78.8k | 18.8k | N/A | NVHPC 22.11 | OLCF Summit |
| NVIDIA A100 | 1 | 114.4k | 34.6k | 16.5k | NVHPC 23.5 | Wingtip |
| AMD MI250X | 1 | 77.5k | 22.3k | 11.2k | CCE 16.0.1 | OLCF Frontier |
| Intel Xeon Gold 6226 | 12 | 2.5k | 0.7k | 0.4k | GNU 10.3.0 | PACE Phoenix |
| Apple Silicon M2 | 6 | 2.8k | 0.6k | 0.2k | GNU 13.2.0 | N/A |

If `'model_eqns' : 3` is replaced by `'model_eqns' : 2`, an inviscid 5-equation model is used.
The following table outlines expected performance in terms of the number of time-steps per hour (rounded to the nearest hundred) for various problem sizes and hardware for an inviscid, 5-equation,
3D simulation.
CPU results utilize an entire die.

| Hardware | # Ranks | 1M Cells | 4M Cells | 8M Cells | Compiler | Computer |
| ---: | :----: | :----: | :---: | :---: | :----: | :--- |
| NVIDIA V100 | 1 | 113.4k | 26.2k | 13.0k | NVHPC 22.11 | PACE Phoenix |
| NVIDIA V100 | 1 | 107.7k | 26.3k | 13.1k | NVHPC 22.11 | OLCF Summit |
| NVIDIA A100 | 1 | 153.5k | 48.0k | 22.5k | NVHPC 23.5 | Wingtip |
| AMD MI250X | 1 | 104.2k | 31.0k | 14.8k | CCE 16.0.1 | OLCF Frontier |
| Intel Xeon Gold 6226 | 12 | 5.4k | 1.6k | 0.8k | GNU 10.3.0 | PACE Phoenix |
| Apple Silicon M2 | 6 | 3.7k | 11.0k | 0.3k | GNU 13.2.0 | N/A |

## Weak scaling

Weak scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant.

### AMD MI250X GPU

MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency.
This corresponds to 87% of the entire machine.

<img src="../res/weakScaling/frontier.svg" style="height: 50%; width:50%; border-radius: 10pt"/>

### NVIDIA V100 GPU

MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency.
This corresponds to 50% of the entire machine.

<img src="../res/weakScaling/summit.svg" style="height: 50%; width:50%; border-radius: 10pt"/>

### IBM Power9 CPU
MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling.

<img src="../res/weakScaling/cpuScaling.svg" style="height: 50%; width:50%; border-radius: 10pt"/>

## Strong scaling

Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases.

### NVIDIA V100 GPU

For these tests, the base case utilizes 8 GPUs with one MPI process per GPU.
The performance is analyzed at two different problem sizes of 16M and 64M grid points, with the base case using 2M and 8M grid points per process.

#### 16M Grid Points

<img src="../res/strongScaling/strongScaling16.svg" style="width: 50%; border-radius: 10pt"/>

#### 64M Grid Points
<img src="../res/strongScaling/strongScaling64.svg" style="width: 50%; border-radius: 10pt"/>

### IBM Power9 CPU

CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process.

<img src="../res/strongScaling/cpuStrongScaling.svg" style="width: 50%; border-radius: 10pt"/>
1 change: 1 addition & 0 deletions docs/documentation/readme.md
Original file line number Diff line number Diff line change
Expand Up @@ -8,6 +8,7 @@
- [Example Cases](examples.md)
- [Running MFC](running.md)
- [Flow Visualisation](visualisation.md)
- [Performance Results](expectedPerformance.md)
- [MFC's Authors](authors.md)
- [References](references.md)

Expand Down
Loading

0 comments on commit 371c51a

Please sign in to comment.