slow `open_dataset` for large NetCDF files #460

briochemc · 2024-10-25T01:55:52Z

Working with some climate model data, I have large (5GB) NetCDF files for each year of simulation that contain many variables (about 50) at monthly intervals. Just "opening" one of these files with open_dataset takes order 2 minutes and a lot of allocations and memory use. In comparison, xarray's open_dataset takes about 2 seconds for the same file:

@time ds = open_dataset(first(files))
123.606544 seconds (60.67 M allocations: 2.608 GiB, 0.61% gc time, 5.99% compilation time)

What am I doing wrong?

The text was updated successfully, but these errors were encountered:

Balinus · 2024-10-25T12:50:31Z

Do you have a link of the file where I could test? I open such files almost daily and 2 minutes seems high. Is it for the 1st call of the function or for the 2nd call?

Balinus · 2024-10-25T13:01:08Z

Tested on a 29GB file, it took 9 seconds for the 1st call.

and 1.3sec on a second call to open_dataset (using a different 29GB file, with similar chunking/structure)

on a small VM (4 CPU, 16GB)

briochemc · 2024-10-25T21:58:02Z

It was similar times on the first and second calls. These files are not "online", they're on Gadi at NCI (the Australian cluster). Is there a place where I can upload one file (about 5GB) for you to test?

Balinus · 2024-10-28T15:15:28Z

Can you download them on your computer and tests when the file is local? Perhaps the problem is more within the "http/downloads" packages and not with YAXArrays ?

As for a file, I don't know, it might be hard on my side within my corporate firewall to easily download your file with most hosting providers. Is the NCI url "open" ?

briochemc · 2024-10-28T22:52:39Z

I already work "locally" in the sense that I don't download the files and instead use a compute node that has direct access to the files. These files on NCI are not accessible without an account with them, hence why I was offering to upload one to somewhere "open".

Balinus · 2024-10-29T12:03:01Z

ok, I understand.

Sometimes on cluster, the filesystem (e.g. gpfs or nfs) can be slow. If you have a lot of I/O, it can be worthwile to copy the file(s) on the current used node for the calculations. For example, on our clusters, this is something like /state/partition. Hence, when starting the slurm job, I sometimes add a

cp /gpfs/folder/*.nc /state/partition/mysername before starting julia. This transfer the file(s) from the shared filesystem across compute nodes to a local/scratch folder accessible only to the compute node used by the folowwing julia script. If there is a bottleneck from the filesystem, it should resolve this aspect. Having the same access time from both the 1st and 2nd call to open_dataset seems to point to a problem associated maybe with this bandwidth. Worth trying I think.
Then in the script, I refer to this local folder (/state/partition)
rm /state/partition/myusername*.nc

In parallel, I am not sure if YAXArrays uses the .zmetadata folder when all the information on the dataset is stored (sidecar folder)? Using that would be quicker I think to open the file (well, getting the metadata information and building the YAXArray)

felixcremer · 2024-10-29T14:15:29Z

Zarr.jl should use the consolidated metadata that is in .zmetadata if available, but this is not applicable here, because he is dealing with NetCDF data.

You should be able to upload an example file here:
https://nextcloud.bgc-jena.mpg.de/s/SN2HJHwALmkReQQ

briochemc · 2024-10-30T00:48:59Z

It does not seem to work but does not tell me why. The file is 5.42 GB in case that's the issue:

briochemc · 2024-10-30T01:37:41Z

@felixcremer I put one such file on my google drive that I can share, if that works?

felixcremer · 2024-10-30T05:40:14Z

@felixcremer I put one such file on my google drive that I can share, if that works?

That works.

briochemc · 2024-10-30T06:19:44Z

Sent an invite to your email (from your GitHub profile)

felixcremer · 2024-10-30T09:32:03Z

I managed to reproduce this locally on my laptop with your dataset. So this is not a file system issue, but this is rather a YAXArrays issue.
I also tried with RasterStack from Rasters.jl and this is much faster.

julia> @time  RasterStack("ocean_month_19901231.nc", lazy=true)
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
 16.379549 seconds (15.43 M allocations: 928.044 MiB, 2.22% gc time, 98.40% compilation time)
╭──────────────────

meggart · 2024-11-04T13:19:14Z

The main problem here is that in the current implementation YAXArrays keeps opening and closing the file several times for every variable inside it which becomes a bit costly. One way to speed this up would be to go back to a NetCDF backend that just maintains a handle to the open file like we did in the past, e.g. by defining this:

import YAXArrayBase as YAB
using NetCDF

YAB.get_var_dims(ds::NetCDF.NcFile,name) = map(i->i.name,ds[name].dim)
YAB.get_varnames(ds::NetCDF.NcFile) = collect(keys(ds.vars))
YAB.get_var_attrs(ds::NetCDF.NcFile, name) = copy(ds[name].atts)
YAB.get_global_attrs(ds::NetCDF.NcFile) = copy(ds.gatts)


YAB.allow_parallel_write(::Type{<:NetCDF.NcFile}) = false
YAB.allow_parallel_write(::NetCDF.NcFile) = false

YAB.allow_missings(::Type{<:NetCDF.NcFile}) = false
YAB.allow_missings(::NetCDF.NcFile) = false
Base.haskey(ds::NetCDF.NcFile,k) = haskey(ds.vars,k)

then open the file is very fast:

using YAXArrays

nc = NetCDF.open(file)
@time open_dataset(nc);

@time open_dataset(nc);

opens the dataset very quickly. However, this means that a handle to the NetCDF file is kept open which does not scale well if you want to lazily concatenate thousands of NetCDF files in a big multi-file dataset, which was the main reason for us to move to lazy file opening. A solution to all problems would be to open the file only for the time the YAXArray is created and all metadata is parsed and to switch to the lazy representation afterwards which means we would need to add some context concept in YAXArrayBase.

I am happy to implement this but the question remains if it is really worth the effort when the medium-term plan is to move the file opening out of YAXArrays and rather rely on functionality implemented in Rasters.jl to open YAXArray datasets.

briochemc · 2024-11-04T23:44:11Z

Thanks! And congrats on figuring out the issue! I just wanted to say that in my case I worked around the issue by "preprocessing" the data in python (essentially selected the variables I needed and saved them in separate files), so no pressure from me!

lazarusA added the question Further information is requested label Oct 26, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

slow `open_dataset` for large NetCDF files #460

slow `open_dataset` for large NetCDF files #460

briochemc commented Oct 25, 2024

Balinus commented Oct 25, 2024 •

edited

Loading

Balinus commented Oct 25, 2024 •

edited

Loading

briochemc commented Oct 25, 2024

Balinus commented Oct 28, 2024

briochemc commented Oct 28, 2024

Balinus commented Oct 29, 2024 •

edited

Loading

felixcremer commented Oct 29, 2024

briochemc commented Oct 30, 2024

briochemc commented Oct 30, 2024

felixcremer commented Oct 30, 2024

briochemc commented Oct 30, 2024

felixcremer commented Oct 30, 2024

meggart commented Nov 4, 2024

briochemc commented Nov 4, 2024

slow open_dataset for large NetCDF files #460

slow open_dataset for large NetCDF files #460

Comments

briochemc commented Oct 25, 2024

Balinus commented Oct 25, 2024 • edited Loading

Balinus commented Oct 25, 2024 • edited Loading

briochemc commented Oct 25, 2024

Balinus commented Oct 28, 2024

briochemc commented Oct 28, 2024

Balinus commented Oct 29, 2024 • edited Loading

felixcremer commented Oct 29, 2024

briochemc commented Oct 30, 2024

briochemc commented Oct 30, 2024

felixcremer commented Oct 30, 2024

briochemc commented Oct 30, 2024

felixcremer commented Oct 30, 2024

meggart commented Nov 4, 2024

briochemc commented Nov 4, 2024

slow `open_dataset` for large NetCDF files #460

slow `open_dataset` for large NetCDF files #460

Balinus commented Oct 25, 2024 •

edited

Loading

Balinus commented Oct 25, 2024 •

edited

Loading

Balinus commented Oct 29, 2024 •

edited

Loading