Merge pull request #256 from wilfonba/io

File Per Process IO, performance summary in docs, new example case.
MFlowCode · Dec 14, 2023 · 371c51a · 371c51a
2 parents 82af415 + 61688cd
commit 371c51a
Show file tree

Hide file tree

Showing 27 changed files with 2,306 additions and 220 deletions.
diff --git a/CMakeLists.txt b/CMakeLists.txt
@@ -446,15 +446,16 @@ if (MFC_SYSCHECK)
 endif()
 
 if (MFC_DOCUMENTATION)
-    # Files in docs/examples are used to generate docs/documentation/examples.md
-    file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples/*")
+    # Files in examples/ are used to generate docs/documentation/examples.md
+    file(GLOB_RECURSE examples_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/examples/*")
 
     add_custom_command(
         OUTPUT  "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md"
-        DEPENDS "${examples_DOCs}"
+        DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh;${examples_DOCs}"
         COMMAND "bash" "${CMAKE_CURRENT_SOURCE_DIR}/docs/examples.sh" 
                        "${CMAKE_CURRENT_SOURCE_DIR}"
         COMMENT "Generating examples.md"
+        VERBATIM
     )
 
     file(GLOB common_DOCs CONFIGURE_DEPENDS "${CMAKE_CURRENT_SOURCE_DIR}/docs/*")
@@ -486,7 +487,7 @@ if (MFC_DOCUMENTATION)
             "${CMAKE_CURRENT_BINARY_DIR}/${target}-Doxyfile" @ONLY)
 
         set(opt_example_dependency "")
-        if (target STREQUAL "documentation")
+        if (${target} STREQUAL documentation)
             set(opt_example_dependency "${CMAKE_CURRENT_SOURCE_DIR}/docs/documentation/examples.md")
         endif()
 

diff --git a/docs/documentation/case.md b/docs/documentation/case.md
@@ -344,6 +344,7 @@ Note that `time_stepper` $=$ 3 specifies the total variation diminishing (TVD),
 | `format`             | Integer | Output format. [1]: Silo-HDF5; [2] Binary	|
 | `precision`          | Integer | [1] Single; [2] Double	 |
 | `parallel_io`        | Logical | Parallel I/O	|
+| `file_per_process`   | Logical | Whether or not to write one IO file per process |
 | `cons_vars_wrt`      | Logical | Write conservative variables |
 | `prim_vars_wrt`      | Logical | Write primitive variables	|
 | `alpha_rho_wrt(i)`   | Logical | Add the partial density of the fluid $i$ to the database \|
@@ -377,7 +378,10 @@ The table lists formatted database output parameters. The parameters define vari
 With parallel I/O, MFC inputs and outputs a single file throughout pre-process, simulation, and post-process, regardless of the number of processors used.
 Parallel I/O enables the use of different number of processors in each of the processes (i.e. simulation data generated using 1000 processors can be post-processed using a single processor).
 
-- `cons_vars_wrt` and `prim_vars_wrt} activate output of conservative and primitive state variables into the database, respectively.
+- `file_per_process` deactivates shared file MPI-IO and activates file per process MPI-IO. The default behaviour is to use a shared file.
+    File per process is usefull when running on 10's of thousands of ranks.
+
+- `cons_vars_wrt` and `prim_vars_wrt` activate output of conservative and primitive state variables into the database, respectively.
 
 - `[variable's name]_wrt` activates output of the each specified variable into the database.
 

diff --git a/docs/documentation/expectedPerformance.md b/docs/documentation/expectedPerformance.md
@@ -0,0 +1,78 @@
+# Performance Results
+
+MFC has been extensively benchmarked on CPUs and GPU devices.
+A summary of these results follows.
+
+## Expected time-steps/hour
+
+The following table outlines expected performance in terms of the number of time steps per hour 
+(rounded to the nearest hundred) for various problem sizes (grid cells) and hardware for an inviscid, 6-equation (`model_eqns' : 3`), 3D simulation.
+CPU results utilize an entire die.
+
+| Hardware             | # Ranks | 1M Cells       | 4M Cells       | 8M Cells     | Compiler    | Computer      |
+| ---:                 | :----:  |    :----:      |  :---:         | :---:        | :----:      | :---          |
+| NVIDIA V100          | 1       | 88.5k          | 18.7k          | N/A          | NVHPC 22.11 | PACE Phoenix  |
+| NVIDIA V100          | 1       | 78.8k          | 18.8k          | N/A          | NVHPC 22.11 | OLCF Summit   |
+| NVIDIA A100          | 1       | 114.4k         | 34.6k          | 16.5k        | NVHPC 23.5  | Wingtip       |
+| AMD MI250X           | 1       | 77.5k          | 22.3k          | 11.2k        | CCE 16.0.1  | OLCF Frontier |
+| Intel Xeon Gold 6226 | 12      | 2.5k           | 0.7k           | 0.4k         | GNU 10.3.0  | PACE Phoenix  |
+| Apple Silicon M2     | 6       | 2.8k           | 0.6k           | 0.2k         | GNU 13.2.0  | N/A           |
+
+If `'model_eqns' : 3` is replaced by `'model_eqns' : 2`, an inviscid 5-equation model is used.
+The following table outlines expected performance in terms of the number of time-steps per hour (rounded to the nearest hundred) for various problem sizes and hardware for an inviscid, 5-equation,
+3D simulation.
+CPU results utilize an entire die.
+
+| Hardware             | # Ranks | 1M Cells       | 4M Cells       | 8M Cells     | Compiler    | Computer      |
+| ---:                 | :----:  |    :----:      |  :---:         | :---:        | :----:      | :---          |
+| NVIDIA V100          | 1       | 113.4k         | 26.2k          | 13.0k        | NVHPC 22.11 | PACE Phoenix  |
+| NVIDIA V100          | 1       | 107.7k         | 26.3k          | 13.1k        | NVHPC 22.11 | OLCF Summit   |
+| NVIDIA A100          | 1       | 153.5k         | 48.0k          | 22.5k        | NVHPC 23.5  | Wingtip       |
+| AMD MI250X           | 1       | 104.2k         | 31.0k          | 14.8k        | CCE 16.0.1  | OLCF Frontier |
+| Intel Xeon Gold 6226 | 12      | 5.4k           | 1.6k           | 0.8k         | GNU 10.3.0  | PACE Phoenix  |
+| Apple Silicon M2     | 6       | 3.7k           | 11.0k          | 0.3k         | GNU 13.2.0  | N/A           |
+
+## Weak scaling
+
+Weak scaling results are obtained by increasing the problem size with the number of processes so that work per process remains constant.
+
+### AMD MI250X GPU
+
+MFC weask scales to (at least) 65,536 AMD MI250X GPUs on OLCF Frontier with 96% efficiency.
+This corresponds to 87% of the entire machine.
+
+<img src="../res/weakScaling/frontier.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
+
+### NVIDIA V100 GPU
+
+MFC weak scales to (at least) 13,824 V100 NVIDIA V100 GPUs on OLCF Summit with 97% efficiency.
+This corresponds to 50% of the entire machine.
+
+<img src="../res/weakScaling/summit.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
+
+### IBM Power9 CPU
+MFC Weak scales to 13,824 Power9 CPU cores on OLCF Summit to within 1% of ideal scaling.
+
+<img src="../res/weakScaling/cpuScaling.svg" style="height: 50%; width:50%; border-radius: 10pt"/>
+
+## Strong scaling
+
+Strong scaling results are obtained by keeping the problem size constant and increasing the number of processes so that work per process decreases.
+
+### NVIDIA V100 GPU
+
+For these tests, the base case utilizes 8 GPUs with one MPI process per GPU.
+The performance is analyzed at two different problem sizes of 16M and 64M grid points, with the base case using 2M and 8M grid points per process.
+
+#### 16M Grid Points
+
+<img src="../res/strongScaling/strongScaling16.svg" style="width: 50%; border-radius: 10pt"/>
+
+#### 64M Grid Points
+<img src="../res/strongScaling/strongScaling64.svg" style="width: 50%; border-radius: 10pt"/>
+
+### IBM Power9 CPU
+
+CPU strong scaling tests are done with problem sizes of 16, 32, and 64M grid points, with the base case using 2, 4, and 8M cells per process.
+
+<img src="../res/strongScaling/cpuStrongScaling.svg" style="width: 50%; border-radius: 10pt"/>
diff --git a/docs/documentation/readme.md b/docs/documentation/readme.md
@@ -8,6 +8,7 @@
 - [Example Cases](examples.md)
 - [Running MFC](running.md)
 - [Flow Visualisation](visualisation.md)
+- [Performance Results](expectedPerformance.md)
 - [MFC's Authors](authors.md)
 - [References](references.md)