Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Questions Raised by pioperf #1893

Open
rjdave opened this issue Dec 6, 2021 · 10 comments
Open

Questions Raised by pioperf #1893

rjdave opened this issue Dec 6, 2021 · 10 comments

Comments

@rjdave
Copy link

rjdave commented Dec 6, 2021

I have been testing PIO 2.5.4 in the ROMS ocean model for a while now. Late last week I started testing the cluster I'm working on with the tests/performance/pioperf provided by PIO. I have only tried with generated data since the Subversion repository mentioned in tests/performance/Pioperformance.md is password protected. This required a switch to building with cmake instead of autotools (#1892), but the results I'm getting seem fairly inline with what I'm seeing in my PIO enabled ROMS ocean model. My ROMS model uses an PIO 2.5.4 configured with autotools without timing enabled but all compilers, libraries, and other options the same as the cmake build.

I am running on 3 nodes of a research cluster. Each node has dual 16-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters and storage is provided by IBM Spectrum Scale (GPFS). Below is my pioperf.nl:

&pioperf
decompfile=   'BLOCK',
 pio_typenames = 'pnetcdf' 'netcdf' 'netcdf4c' 'netcdf4p'
 rearrangers = 1,2
 nframes = 10
 nvars = 8
 niotasks = 6
 varsize = 100000
/

And the results are:

 (t_initf) Read in prof_inparm namelist from: pioperf.nl
  Testing decomp: BLOCK
 iotype=           1  of            4
 RESULT: write       BOX         1         6         8     1343.6480252327
  RESULT: read       BOX         1         6         8     6796.6537469962
 RESULT: write    SUBSET         1         6         8     2037.0848491243
  RESULT: read    SUBSET         1         6         8     2213.6412788202
 iotype=           2  of            4
 RESULT: write       BOX         2         6         8      878.4271379151
  RESULT: read       BOX         2         6         8     1686.7799257658
 RESULT: write    SUBSET         2         6         8      835.9381757852
  RESULT: read    SUBSET         2         6         8     1702.5454246362
 iotype=           3  of            4
 RESULT: write       BOX         3         6         8     1007.6473058227
  RESULT: read       BOX         3         6         8     2030.3052886453
 RESULT: write    SUBSET         3         6         8      942.8001105710
  RESULT: read    SUBSET         3         6         8     2156.0752216195
 iotype=           4  of            4
 RESULT: write       BOX         4         6         8      223.5714932068
  RESULT: read       BOX         4         6         8     2925.2752645271
 RESULT: write    SUBSET         4         6         8      232.4932193447
  RESULT: read    SUBSET         4         6         8     4293.8078647335

As you can see, the slowest write time is for parallel NetCDF4/HDF5 files. On this system, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF v1.12.2 are configured and built by me with the Intel compiler and MPI (v19.1.5).

I also have access to a second research cluster with dual 20-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters with lustre storage. Not quite apples to apples but fairly close. On this machine, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF 1.12.1 are all configured and built with Intel 2020 and Intel MPI by the system administrators. Here are the results on that system with the same pioperf.nl:

(t_initf) Read in prof_inparm namelist from: pioperf.nl
 Testing decomp: BLOCK
iotype=           1  of            4
RESULT: write       BOX         1         6         8     1267.8699411769
 RESULT: read       BOX         1         6         8     1091.6380573150
RESULT: write    SUBSET         1         6         8     1235.7559619398
 RESULT: read    SUBSET         1         6         8     1412.1266585430
iotype=           2  of            4
RESULT: write       BOX         2         6         8      392.0196185289
 RESULT: read       BOX         2         6         8     1763.8134047182
RESULT: write    SUBSET         2         6         8      397.0943986050
 RESULT: read    SUBSET         2         6         8     1830.4833729873
iotype=           3  of            4
RESULT: write       BOX         3         6         8      553.0218402955
 RESULT: read       BOX         3         6         8     3070.8227757982
RESULT: write    SUBSET         3         6         8      537.8703873321
 RESULT: read    SUBSET         3         6         8     3111.6566202294
iotype=           4  of            4
RESULT: write       BOX         4         6         8      300.8776448667
 RESULT: read       BOX         4         6         8     3015.9222552535
RESULT: write    SUBSET         4         6         8      348.4993834234
 RESULT: read    SUBSET         4         6         8     3060.4128763664

All tests were run at least five times on each cluster. I did not average them but the runs shown are consistent with the other runs on the system. You can see that they both perform pretty well with pnetcdf (iotype=1) and pretty poorly with parallel writes using the NetCDF4/HDF5 library (iotype=4). Obviously, there are other intriguing differences here but I would like to focus on the poor parallel wrting speeds for NetCDF4/HDF5. Even compared to serial writes with NetCDF4/HDF5 (iotype=3) the parallel wrting is slower.

Does anyone have any insights as to what may be happening here?

@jedwards4b
Copy link
Contributor

I can't say much except that this is consistent with my own experience.

@edwardhartnett
Copy link
Collaborator

edwardhartnett commented Dec 7, 2021

That is happening because PIO automatically turns on zlib compression for data in netCDF/HDF5 files. That's quite slow.

Using the new netCDF integration feature, you can use PIO with the netCDF APIs, and it does not automatically turn on compression - you must explicitly turn it on for each variable in the netCDF API. In that case, you will see much faster write times for netCDF/HDF5 files.

I am presenting a paper at the AGU about compression, here's a graph that illustrates how much zlib impacts performance:
image

Note how large the write rate is for comression = "none".

@jedwards4b
Copy link
Contributor

@edwardhartnett although this is true for iotype=3, I don't think it's the case for iotype=4.

@rjdave
Copy link
Author

rjdave commented Dec 7, 2021

It does not appear that any of the modes use compression. When I ncdump -hs each of the output files, none of them have a _DeflateLevel attribute.

@edwardhartnett
Copy link
Collaborator

OK, sorry, you are quite right. So why so slow?

Are the chunksizes set to match the chunks of data being written?

@rjdave
Copy link
Author

rjdave commented Dec 7, 2021

When I run the test built with #define VARINT 1 or #define VARREAL 1 then the chunks are 1 record in size:

netcdf pioperf.1-0006-3 {
dimensions:
        dim000001 = 9600000 ;
        time = UNLIMITED ; // (10 currently)
variables:
        int vari0001(time, dim000001) ;
                vari0001:_FillValue = -2147483647 ;
                vari0001:_Storage = "chunked" ;
                vari0001:_ChunkSizes = 1, 960000 ;
                vari0001:_Endianness = "little" ;
                vari0001:_NoFill = "true" ;
...

When I switch to #define VARDOUBLE 1 it's approximately 1/19 of a record:

netcdf pioperf.1-0006-4 {
dimensions:
        dim000001 = 9600000 ;
        time = UNLIMITED ; // (10 currently)
variables:
        double vard0001(time, dim000001) ;
                vard0001:_FillValue = 9.96920996838687e+36 ;
                vard0001:_Storage = "chunked" ;
                vard0001:_ChunkSizes = 1, 505264 ;
                vard0001:_Endianness = "little" ;
                vard0001:_NoFill = "true" ;
...

It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better.

@edhartnett
Copy link
Collaborator

edhartnett commented Dec 7, 2021 via email

@jedwards4b
Copy link
Contributor

Is there anything we can do to make a better guess about chunksize within the parallelIO library? Perhaps there is information in the decomp we can use to improve netcdf4 parallel performance?

@edwardhartnett
Copy link
Collaborator

It is a hard problem. Default chunksizes are chosen by the netcdf-c library, and it's very hard to choose good ones. Essentially, the programmer must match the chunksizes to their IO.

If you have a bunch of processors each writing slices of data of size X, Y, Z - then X, Y, Z is a good chunksize. But how am I going to guess that with just the information in netcdf metadata? There is no clue.

Using the decomp is a good idea to come up with a different set of chunksizes, but I don't have time to look at that - I've just taken over NOAA's GRIB libraries and there is so much to do there...

@jedwards4b
Copy link
Contributor

In pio2.5.5 we have added a fortran interface to PIOc_write_nc_decomp and PIOc_read_nc_decomp and I have written a
program to translate decomps in the old text format to the new netcdf format. I would like to store these files someplace that is publicly accessible instead of in the cgd subversion server - any suggestions as to where would be the best place?

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

4 participants