-
Notifications
You must be signed in to change notification settings - Fork 53
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Questions Raised by pioperf #1893
Comments
I can't say much except that this is consistent with my own experience. |
@edwardhartnett although this is true for iotype=3, I don't think it's the case for iotype=4. |
It does not appear that any of the modes use compression. When I |
OK, sorry, you are quite right. So why so slow? Are the chunksizes set to match the chunks of data being written? |
When I run the test built with
When I switch to
It also might be worth noting that the write speed for iotype 4 goes from the low to mid 200s to the mid to high 500s. Still by far the slowest iotype, but better. |
Try making the chunksize for the first dimension greater than 1, and the
chunksize for the second dimension smaller. Chunks do better when they are
more square shaped.
Also what is the write patter of each processor? That would be the best
chunksize...
…On Tue, Dec 7, 2021 at 1:50 PM rjdave ***@***.***> wrote:
When I do run the test built with #define VARINT 1 or #define VARREAL 1
then the chunks are 1 record in size:
netcdf pioperf.1-0006-3 {
dimensions:
dim000001 = 9600000 ;
time = UNLIMITED ; // (10 currently)
variables:
int vari0001(time, dim000001) ;
vari0001:_FillValue = -2147483647 ;
vari0001:_Storage = "chunked" ;
vari0001:_ChunkSizes = 1, 960000 ;
vari0001:_Endianness = "little" ;
vari0001:_NoFill = "true" ;
...
When I switch to `#define VARDOUBLE 1" it's approximately 1/19 of a record:
netcdf pioperf.1-0006-4 {
dimensions:
dim000001 = 9600000 ;
time = UNLIMITED ; // (10 currently)
variables:
double vard0001(time, dim000001) ;
vard0001:_FillValue = 9.96920996838687e+36 ;
vard0001:_Storage = "chunked" ;
vard0001:_ChunkSizes = 1, 505264 ;
vard0001:_Endianness = "little" ;
vard0001:_NoFill = "true" ;
...
It also might be worth noting that the write speed for iotype 4 goes from
the low to mid 200s to the mid to high 500s. Still by far the slowest
iotype, but better.
—
You are receiving this because you are subscribed to this thread.
Reply to this email directly, view it on GitHub
<#1893 (comment)>,
or unsubscribe
<https://github.com/notifications/unsubscribe-auth/ABCSXXC6EUPFYZ5TIXL6GYTUPZXSJANCNFSM5JPSD4HA>
.
Triage notifications on the go with GitHub Mobile for iOS
<https://apps.apple.com/app/apple-store/id1477376905?ct=notification-email&mt=8&pt=524675>
or Android
<https://play.google.com/store/apps/details?id=com.github.android&referrer=utm_campaign%3Dnotification-email%26utm_medium%3Demail%26utm_source%3Dgithub>.
|
Is there anything we can do to make a better guess about chunksize within the parallelIO library? Perhaps there is information in the decomp we can use to improve netcdf4 parallel performance? |
It is a hard problem. Default chunksizes are chosen by the netcdf-c library, and it's very hard to choose good ones. Essentially, the programmer must match the chunksizes to their IO. If you have a bunch of processors each writing slices of data of size X, Y, Z - then X, Y, Z is a good chunksize. But how am I going to guess that with just the information in netcdf metadata? There is no clue. Using the decomp is a good idea to come up with a different set of chunksizes, but I don't have time to look at that - I've just taken over NOAA's GRIB libraries and there is so much to do there... |
In pio2.5.5 we have added a fortran interface to PIOc_write_nc_decomp and PIOc_read_nc_decomp and I have written a |
I have been testing PIO 2.5.4 in the ROMS ocean model for a while now. Late last week I started testing the cluster I'm working on with the
tests/performance/pioperf
provided by PIO. I have only tried with generated data since the Subversion repository mentioned intests/performance/Pioperformance.md
is password protected. This required a switch to building with cmake instead of autotools (#1892), but the results I'm getting seem fairly inline with what I'm seeing in my PIO enabled ROMS ocean model. My ROMS model uses an PIO 2.5.4 configured with autotools without timing enabled but all compilers, libraries, and other options the same as the cmake build.I am running on 3 nodes of a research cluster. Each node has dual 16-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters and storage is provided by IBM Spectrum Scale (GPFS). Below is my
pioperf.nl
:And the results are:
As you can see, the slowest write time is for parallel NetCDF4/HDF5 files. On this system, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF v1.12.2 are configured and built by me with the Intel compiler and MPI (v19.1.5).
I also have access to a second research cluster with dual 20-core Intel Skylake processors connected by Infiniband HDR (100Gb/s) adapters with lustre storage. Not quite apples to apples but fairly close. On this machine, HDF5 v1.10.6, NetCDF4 v4.7.4, and PNetCDF 1.12.1 are all configured and built with Intel 2020 and Intel MPI by the system administrators. Here are the results on that system with the same pioperf.nl:
All tests were run at least five times on each cluster. I did not average them but the runs shown are consistent with the other runs on the system. You can see that they both perform pretty well with pnetcdf (iotype=1) and pretty poorly with parallel writes using the NetCDF4/HDF5 library (iotype=4). Obviously, there are other intriguing differences here but I would like to focus on the poor parallel wrting speeds for NetCDF4/HDF5. Even compared to serial writes with NetCDF4/HDF5 (iotype=3) the parallel wrting is slower.
Does anyone have any insights as to what may be happening here?
The text was updated successfully, but these errors were encountered: