Analysis-ready chunking of diagnostic output files #203

aekiss · 2024-08-15T02:16:55Z

Following from @Thomas-Moore-Creative's talk today, we should think about the NetCDF chunking we use to write to disk, so that the native chunking is OK for typical workflows.

Note that in a compressed, chunked NetCDF file, if you access any data in a chunk, the whole chunk needs to be read and uncompressed. So that can be a pitfall if the chunking doesn't match the access requirements, e.g. chunks are too big in the wrong dimensions. e.g. we had that problem with ERA5 forcing in ACCESS-OM2: COSIMA/access-om2#242

Maybe we should set up a discussion/poll on the forum?

Thomas-Moore-Creative · 2024-08-15T04:19:20Z

@aekiss - after all my bluster about how important the choice of "native chunking" on the raw output is, what do we know about the limitations ( if any) on different models ability to control chunking of output at run-time? Where do modellers have that control in, say, MOM6? Is that dependent on / limited by how the model tiling is setup?

A recent conversation I had with @dougiesquire mused about choosing native chunking that was suited for and facilitated easier rechunking later. One of the problems that comes up is if you, for example, have very large chunks and are forced to load into memory most or all of the dataset to rechunk into another chunking arrangement.

That being said I'm not clear what the current COSIMA native chunking is and if it would need or benefit from change? ( other products I've come across very much do )

aekiss · 2024-08-15T06:53:14Z

Good questions.

In terms of output directly from the model components,

MOM6 chunking in x and y is controlled by IO_LAYOUT, which can differ from the tiling; there's 1 chunk in z and 1 chunk in time per file (I think)
CICE6 now supports a user-specified history_chunksize as of Update IO formats and add new IO namelist controls CICE-Consortium/CICE#928 which controls chunking in x and y. There's no chunking in time, as CICE6 outputs a new file for each timestamp. @anton-seaice knows a lot more about it than me.

Model runs are broken into short segments to fit into queue limits (so segments are shortest at high resolution, e.g. a few months) so post-processing would be required to change the chunking in time.

aekiss · 2024-08-15T06:57:13Z

The other consideration is the impact of chunking on IO performance of the model itself (which can become a bottleneck at high resolution). There's a lot of discussion of this in https://gmd.copernicus.org/articles/13/1885/2020/

It would be nice if there was a compromise that worked well both for runtime performance and analysis, but maybe these are incompatible and raw model outputs would require post-processing to suit analysis.

anton-seaice · 2024-08-15T07:05:12Z

I believe MOM chunksizes are set in the fml namelist:

&fms2_io_nml
    ncchksz = 4194304
...

which is 4MB. I think part of the goal in having that size quite small is that it avoids splitting the chunks during analysis as much as practical (and some other reason about cache sizes I guess?)

It's hard to imagine model output having a chunksizes in time of anything other than 1. Like either it needs:

keeping the model output in memory for multiple time averages (maybe possible as we don't seem to be very memory limited),
or writing the chunks "out of order". e.g. if the time chunksize is 31, to write output at the end of each model day would need the model to only write to every 31st place in the output ... which sounds slow.

So I think its a question of how much extra time do we want running the model, vs how much extra time is it in analysis ?

angus-g · 2024-08-15T07:42:55Z

I think that is a poorly-named parameter that refers only to the internal library chunking (and maybe even only for NetCDF classic files, rather than the HDF5-backed NetCDF4 files). The per-dimension chunking is defined in netcdf var_def calls, and needs an array of chunksizes, rather than figuring it out from an overall chunk size. I think it is indeed the case that it depends on the IO_LAYOUT in the case of diagnostic output.

anton-seaice · 2024-08-15T22:46:25Z

Thanks Angus! We might need to revisit ncchksz which is more of a cache size when we tune the IO_LAYOUT. And that makes sense the chunksize related to IO_LAYOUT in x/y

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Analysis-ready chunking of diagnostic output files #203

Analysis-ready chunking of diagnostic output files #203

aekiss commented Aug 15, 2024 •

edited

Loading

Thomas-Moore-Creative commented Aug 15, 2024

aekiss commented Aug 15, 2024

aekiss commented Aug 15, 2024 •

edited

Loading

anton-seaice commented Aug 15, 2024

angus-g commented Aug 15, 2024

anton-seaice commented Aug 15, 2024

Analysis-ready chunking of diagnostic output files #203

Analysis-ready chunking of diagnostic output files #203

Comments

aekiss commented Aug 15, 2024 • edited Loading

Thomas-Moore-Creative commented Aug 15, 2024

aekiss commented Aug 15, 2024

aekiss commented Aug 15, 2024 • edited Loading

anton-seaice commented Aug 15, 2024

angus-g commented Aug 15, 2024

anton-seaice commented Aug 15, 2024

aekiss commented Aug 15, 2024 •

edited

Loading

aekiss commented Aug 15, 2024 •

edited

Loading