Solving the 3D heat equation with a finite difference method. The computatinal domain is a cube with the size of . Animation shows the simulation results.
At the simulation results are compared with the analytical solution for sanity check.
The difference L2_norm
will be printed out in the standard output.
We apply preidoic boundary conditions in each direction as
In order to compare with the analytical solution, we apply the following initial condition:
With the boundary and initial conditions given above, we have the analytical solution as
The numerical solution is checked against this solution.
heat3d application is parallelized with stdpar, OpenMP, OpenACC, OpenMP4.5, Kokkos, thrust, CUDA and HIP.
Since our main focus is stdpar, we briefly describe the parallel kernels and data structures.
To represent the 3D variable , we use the 3D data structure View
consisting of
std::vector
and stdex::mdspan
(see implementation for detail).
The parallel computations are performed with std::for_each_n
and std::transform_reduce
.
We define in the following manner.
u = RealView3D("u", std::array<size_t, 3>{nx_halo, ny_halo, nz_halo}, std::array<int, 3>{-1, -1, -1});
Where the second and third arguments are extents and the starting indices of this view, respectively.
For both CPUs and GPUs, we employ Fortran style data layout (stdex::layout_left
).
In the host code, the data u
can be accessed through its operator()
like
for(int iz=0; iz<nz; iz++) {
for(int iy=0; iy<ny; iy++) {
for(int ix=0; ix<nx; ix++) {
const real_type xtmp = static_cast<real_type>(ix - nx/2) * dx;
const real_type ytmp = static_cast<real_type>(iy - ny/2) * dy;
const real_type ztmp = static_cast<real_type>(iz - nz/2) * dz;
x(ix) = xtmp;
y(iy) = ytmp;
z(iz) = ztmp;
u(ix, iy, iz) = conf.umax
* cos(xtmp / Lx * 2.0 * M_PI + ytmp / Ly * 2.0 * M_PI + ztmp / Lz * 2.0 * M_PI);
}
}
}
We need to access the data through mdspan
in the accelerated region. Which can be accessed by the mdspan()
method.
Since the multi-dimensional parallel operations are not fully supported before cartesian-product in c++23,
we have wrapped std::for_each_n
and std::transform_reduce
manually.
The 3D heat equation can be performed as
Iterate_policy<3> policy3d({0, 0, 0}, {nx, ny, nz});
Impl::for_each(policy3d, heat_functor(conf, u, un));
with the heat_functor
defined by
struct heat_functor {
using mdspan3d_type = RealView3D::mdspan_type;
mdspan3d_type u_, un_;
float64 coef_;
heat_functor(Config &conf, RealView3D &u, RealView3D &un) {
u_ = u.mdspan();
un_ = un.mdspan();
coef_ = conf.Kappa * conf.dt / (conf.dx*conf.dx);
}
void operator()(const int ix, const int iy, const int iz) const {
un_(ix, iy, iz) = u_(ix, iy, iz)
+ coef_ * ( u_(ix+1, iy, iz) + u_(ix-1, iy, iz)
+ u_(ix, iy+1, iz) + u_(ix, iy-1, iz)
+ u_(ix, iy, iz+1) + u_(ix, iy, iz-1)
- 6. * u_(ix, iy, iz) );
}
};
As explained, we accessed the data u
and un
through mdspan
, which are captured by values in the functor for std::for_each_n
.
In the end, the numerical solution is compared against this analytical solution, where we perform parallel reduction with std::transform_reduce
.
float64 l2loc = 0.0;
auto _u = u.mdspan();
auto _un = un.mdspan();
auto L2norm_kernel = [=] (const int ix, const int iy, const int iz) {
auto diff = _un(ix, iy, iz) - _u(ix, iy, iz);
return diff * diff;
};
Iterate_policy<3> policy3d({0, 0, 0}, {conf.nx, conf.ny, conf.nz});
Impl::transform_reduce(policy3d, std::plus<float64>(), L2norm_kernel, l2loc);
For both cases, we firstly define the parallel operations over nx * ny * nz
. The 3D indices are computed from flattend 1D index manually inside Impl::for_each
and Impl::transform_reduce
.
You may find the sourcde codes for each programming model under heat3d
directory.
---/
|
└──heat3d/
|--kokkos/
|--openacc/
|--openmp/
|--stdpar/
|--thrust/
|--CMakeLists.txt
|--Helper.hpp
|--Parser.hpp
└──Timer.hpp
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_BUILD_TYPE=Release -DBUILD_TESTING=OFF -DPROGRAMMING_MODEL=STDPAR -DBACKEND=CUDA -DAPPLICATION=heat3d ..
cmake --build .
./heat3d --nx 512 --ny 512 --nz 512 --nbiter 50000 --freq_diag 1000
You can set the problem size (resolution), the number of iteration and the frequency of diagnostics.
nx
: The number of grid points in x direction. (default: 128)ny
: The number of grid points in y direction. (default: 128)nz
: The number of grid points in z direction. (default: 128)nbiter
: The number of iterations of simulations. (default: 1000)freq_diag
: The diagnostics is made for everyfreq_diag
step. (default: 10)
For example, we have executed the following command to make the animation.
./heat3d --nx 512 --ny 512 --nz 512 --nbiter 50000 --freq_diag 1000
With the run command
./heat3d --nx 512 --ny 512 --nz 512 --nbiter 1000 --freq_diag 0
, you will get the following standard output (performed on V100 GPUs).
(nx, ny, nz) = 512, 512, 512
L2_norm: 0.00355178
Programming model: stdpar
Backend: CUDA
Elapsed time: 6.53835 [s]
Bandwidth: 328.444 [GB/s]
Flops: 184.75 [GFlops]
total 6.53835 [s], 1 calls
MainLoop 6.53832 [s], 1000 calls
Heat 6.53817 [s], 1000 calls
IO 0 [s], 0 calls
You may find the sourcde codes for each programming model under heat3d_mpi
directory.
---/
|
└──heat3d_mpi/
|--kokkos/
|--openacc/
|--openmp/
|--stdpar/
|--thrust/
|--CMakeLists.txt
|--Helper.hpp
|--Parser.hpp
└──Timer.hpp
mkdir build && cd build
cmake -DCMAKE_CXX_COMPILER=nvc++ -DCMAKE_BUILD_TYPE=Release -DPROGRAMMING_MODEL=STDPAR -DBACKEND=CUDA -DAPPLICATION=heat3d_mpi ..
cmake --build .
To enforce the device to device MPI communication, it is important to set the environmental variable
export UCX_RNDV_FRAG_MEM_TYPE=cuda
This environmental variable is avilable with HPC-X (OpenMPI 4.1.4 + UCX 1.13.0) under Nvidia HPC SDK v22.5 (or later). The run command in a job script is as follows:
export UCX_RNDV_FRAG_MEM_TYPE=cuda
mpirun -np 2 ./wrapper.sh ../build/heat3d_mpi/stdpar/heat3d_mpi --px 2 --py 2 --pz 2 --nx 256 --ny 256 --nz 256 --nbiter 1000 --freq_diag 10
GPU mapping inside a node should be made before MPI_Init
with wrapper.sh
(for OpenMPI).
#!/bin/sh
NGPUS=`nvidia-smi -L | wc -l`
export CUDA_VISIBLE_DEVICES=$((OMPI_COMM_WORLD_LOCAL_RANK % NGPUS))
exec $*
You can set the problem size (resolution), the MPI domain decomposition, the number of iteration and the frequency of diagnostics.
nx
: The number of grid points in x direction. (default: 128)ny
: The number of grid points in y direction. (default: 128)nz
: The number of grid points in z direction. (default: 128)px
: The number of MPI processes in x direction. (default: 2)py
: The number of MPI processes in y direction. (default: 2)pz
: The number of MPI processes in z direction. (default: 2)nbiter
: The number of iterations of simulations. (default: 1000)freq_diag
: The diagnostics is made for everyfreq_diag
step. (default: 10)
Contrary to the heat3d
case, nx
, ny
, and nz
are the number of grid points in each MPI process. In other words,
the total number of grid points in each direction are nx * px
, ny * py
, and nz * pz
.
It should be noted that the total MPI processes nb_procs
must be eqaul to px * py * pz
.
If we set px = 1
, then MPI communication along x direction is suppressed and replaced by swapping halo regions.
With the run command
./heat3d_mpi --px 1 --py 1 --pz 2 --nx 512 --ny 512 --nz 256 --nbiter 1000 --freq_diag 0
, you will get the following standard output (performed on V100 GPUs).
Parallelization (px, py, pz) = 1, 1, 2
Local (nx, ny, nz) = 512, 512, 256
Global (nx, ny, nz) = 512, 512, 512
L2_norm: 0.00355178
3D Range policy
STDPAR backend
Elapsed time: 3.62221 [s]
Bandwidth/GPU: 296.433 [GB/s]
Flops/GPU: 166.744 [GFlops]
total 3.62221 [s], 1 calls
MainLoop 3.62217 [s], 1000 calls
Heat 3.20131 [s], 1000 calls
HaloPack 0.101119 [s], 3000 calls
HaloUnpack 0.172946 [s], 3000 calls
HaloComm 0.146246 [s], 3000 calls
IO 0 [s], 0 calls