This project demonstrates the feasibility of migrating legacy PETSc-based applications to modern supercomputers (which are often heterogeneous platforms) with minor code modifications in PETSc's source code.
PETSc (Portable, Extensible Toolkit for Scientific Computation) is an MPI-based parallel linear algebra library. It has been used to build many scientific codes in HPC (high-performance computing) area for over two decades. While PETSc provides excellent performance on CPU machines, it still lacks satisfying GPU support. Nowadays, GPU plays an increasingly important role in modern supercomputers, and due to PETSc's lagging GPU support, PETSc-based applications may need to find other ways to move forward in hybrid accelerated systems.
This project demonstrates that it's not difficult for PETSc users to enable GPU capability. Minor code modifications in PETSc's source code can achieve that. Directive-based programming models, such as OpenACC, are suitable for this kind of minor coding works.
The speedup may not be appealing in this way because we avoid re-designing numerical methods and parallel algorithms. The sequential kernels called by each MPI process in PETSc are originally designed for a single CPU core. Thus, naively inserting OpenACC directives into source code may not be able to hide data transfer latency efficiently. And some kernels are difficult to be parallelized without re-designing their algorithms.
Nevertheless, small speedups can still be useful to codes running on some supercomputers, such as Titan and Summit. These supercomputers only provide hybrid nodes (i.e., CPU + GPU). Hence, for PETSc applications running on those supercomputers, minor code modification in an exchange with GPU capability and a small speedup may be acceptable. It's all about a balance between coding effort and computational performance.
- Target problem: a 3D Poisson problem, which represents a bottleneck of many CFD (computational fluid dynamics) codes.
- The KSP linear solver will be CG (conjugate-gradient method) + GAMG (algebra multigrid preconditioner)
- Target platform: Titan
In order to avoid potential license issues, all code snippets from PETSc are
left out. Instead, patch files are used to create OpenACC kernels by patching
original PETSc source code. Once users use command make build-petsc
to
download and build PETSc, the command will automatically extract necessary PETSc
kernel functions to directory src/original. And it will next patch these
PETSc kernels to create OpenACC kernels, which will be located in
src/openacc-step[1-4].
In folder runs
, there are some PBS scripts for running some tests/benchmarks
on Titan. Users can submit these PBS jobs through make
or qsub
directly. But
jobs must submit under the top-level directory of this repo, because there are
some relative paths used. See the usage below.
At top-level directory:
source ./scripts/set_up_environment.sh
: setup the environment on Titanmake help
: see helpmake list
: list all targetsmake list-executables
: list all targets for building executablesmake list-runs
: list all targets for submitting PBS jobsmake build-petsc
: build PETSc library, extract necessary PETSc kernels to src/original, and then create OpenACC kernels in src/openacc-step[1-4]make all
: build all executablesmake <executable>
: build an individual executablemake PROJ=<chargeable project> PROJFOLDER=<usable folder under $MEMBERWORK> <run>
: submit a run shown inmake list-runs
using the allocation of <chargeable proj>; or you can use alternative commandqsub -A <chargeable project> -v PROJFOLDER=<usable folder under $MEMBERWORK>,EXEC=<executable> runs/<PBS script>.pbs
.PROJFOLDER
will be used as a temporary working directory.make clean-build
: clean executables and object filesmake clean-petsc
: clean build PETSc librarymake clean-all
: clean everythingmake create-plots
: create plots for strong scaling and speedups according to result files under the folder runs. Must get some results (e.g. run some PBS jobs first) prior calling this target.
Results from a 300x300x300 Poisson problem. For single-node tests, 1, 2, 4, 8, and 16 CPU cores and 1 K20x GPU were used on a Titan node. For multiple-node tests, 1, 2, 4, 8, 16, 32, and 64 Titan nodes were used (16 CPU cores + 1 K20x GPU per node).
Use GitHub issue or email: [email protected]