Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

slow open_dataset for large NetCDF files #460

Open
briochemc opened this issue Oct 25, 2024 · 14 comments
Open

slow open_dataset for large NetCDF files #460

briochemc opened this issue Oct 25, 2024 · 14 comments
Labels
question Further information is requested

Comments

@briochemc
Copy link
Contributor

Working with some climate model data, I have large (5GB) NetCDF files for each year of simulation that contain many variables (about 50) at monthly intervals. Just "opening" one of these files with open_dataset takes order 2 minutes and a lot of allocations and memory use. In comparison, xarray's open_dataset takes about 2 seconds for the same file:

@time ds = open_dataset(first(files))
123.606544 seconds (60.67 M allocations: 2.608 GiB, 0.61% gc time, 5.99% compilation time)

What am I doing wrong?

@Balinus
Copy link
Contributor

Balinus commented Oct 25, 2024

Do you have a link of the file where I could test? I open such files almost daily and 2 minutes seems high. Is it for the 1st call of the function or for the 2nd call?

@Balinus
Copy link
Contributor

Balinus commented Oct 25, 2024

Tested on a 29GB file, it took 9 seconds for the 1st call.

image

and 1.3sec on a second call to open_dataset (using a different 29GB file, with similar chunking/structure)

image

on a small VM (4 CPU, 16GB)

image

@briochemc
Copy link
Contributor Author

It was similar times on the first and second calls. These files are not "online", they're on Gadi at NCI (the Australian cluster). Is there a place where I can upload one file (about 5GB) for you to test?

@lazarusA lazarusA added the question Further information is requested label Oct 26, 2024
@Balinus
Copy link
Contributor

Balinus commented Oct 28, 2024

Can you download them on your computer and tests when the file is local? Perhaps the problem is more within the "http/downloads" packages and not with YAXArrays ?

As for a file, I don't know, it might be hard on my side within my corporate firewall to easily download your file with most hosting providers. Is the NCI url "open" ?

@briochemc
Copy link
Contributor Author

I already work "locally" in the sense that I don't download the files and instead use a compute node that has direct access to the files. These files on NCI are not accessible without an account with them, hence why I was offering to upload one to somewhere "open".

@Balinus
Copy link
Contributor

Balinus commented Oct 29, 2024

ok, I understand.

Sometimes on cluster, the filesystem (e.g. gpfs or nfs) can be slow. If you have a lot of I/O, it can be worthwile to copy the file(s) on the current used node for the calculations. For example, on our clusters, this is something like /state/partition. Hence, when starting the slurm job, I sometimes add a

  • cp /gpfs/folder/*.nc /state/partition/mysername before starting julia. This transfer the file(s) from the shared filesystem across compute nodes to a local/scratch folder accessible only to the compute node used by the folowwing julia script. If there is a bottleneck from the filesystem, it should resolve this aspect. Having the same access time from both the 1st and 2nd call to open_dataset seems to point to a problem associated maybe with this bandwidth. Worth trying I think.
  • Then in the script, I refer to this local folder (/state/partition)
  • rm /state/partition/myusername*.nc

In parallel, I am not sure if YAXArrays uses the .zmetadata folder when all the information on the dataset is stored (sidecar folder)? Using that would be quicker I think to open the file (well, getting the metadata information and building the YAXArray)

@felixcremer
Copy link
Member

Zarr.jl should use the consolidated metadata that is in .zmetadata if available, but this is not applicable here, because he is dealing with NetCDF data.

You should be able to upload an example file here:
https://nextcloud.bgc-jena.mpg.de/s/SN2HJHwALmkReQQ

@briochemc
Copy link
Contributor Author

It does not seem to work but does not tell me why. The file is 5.42 GB in case that's the issue:

Screenshot 2024-10-30 at 11 47 15 am

@briochemc
Copy link
Contributor Author

@felixcremer I put one such file on my google drive that I can share, if that works?

@felixcremer
Copy link
Member

@felixcremer I put one such file on my google drive that I can share, if that works?

That works.

@briochemc
Copy link
Contributor Author

Sent an invite to your email (from your GitHub profile)

@felixcremer
Copy link
Member

I managed to reproduce this locally on my laptop with your dataset. So this is not a file system issue, but this is rather a YAXArrays issue.
I also tried with RasterStack from Rasters.jl and this is much faster.

julia> @time  RasterStack("ocean_month_19901231.nc", lazy=true)
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
┌ Warning: unsupported calendar `GREGORIAN`. Time units are ignored.
└ @ CommonDataModel ~/.julia/packages/CommonDataModel/G3moc/src/cfvariable.jl:203
 16.379549 seconds (15.43 M allocations: 928.044 MiB, 2.22% gc time, 98.40% compilation time)
╭──────────────────

@meggart
Copy link
Member

meggart commented Nov 4, 2024

The main problem here is that in the current implementation YAXArrays keeps opening and closing the file several times for every variable inside it which becomes a bit costly. One way to speed this up would be to go back to a NetCDF backend that just maintains a handle to the open file like we did in the past, e.g. by defining this:

import YAXArrayBase as YAB
using NetCDF

YAB.get_var_dims(ds::NetCDF.NcFile,name) = map(i->i.name,ds[name].dim)
YAB.get_varnames(ds::NetCDF.NcFile) = collect(keys(ds.vars))
YAB.get_var_attrs(ds::NetCDF.NcFile, name) = copy(ds[name].atts)
YAB.get_global_attrs(ds::NetCDF.NcFile) = copy(ds.gatts)


YAB.allow_parallel_write(::Type{<:NetCDF.NcFile}) = false
YAB.allow_parallel_write(::NetCDF.NcFile) = false

YAB.allow_missings(::Type{<:NetCDF.NcFile}) = false
YAB.allow_missings(::NetCDF.NcFile) = false
Base.haskey(ds::NetCDF.NcFile,k) = haskey(ds.vars,k)

then open the file is very fast:

using YAXArrays

nc = NetCDF.open(file)
@time open_dataset(nc);

@time open_dataset(nc);

opens the dataset very quickly. However, this means that a handle to the NetCDF file is kept open which does not scale well if you want to lazily concatenate thousands of NetCDF files in a big multi-file dataset, which was the main reason for us to move to lazy file opening. A solution to all problems would be to open the file only for the time the YAXArray is created and all metadata is parsed and to switch to the lazy representation afterwards which means we would need to add some context concept in YAXArrayBase.

I am happy to implement this but the question remains if it is really worth the effort when the medium-term plan is to move the file opening out of YAXArrays and rather rely on functionality implemented in Rasters.jl to open YAXArray datasets.

@briochemc
Copy link
Contributor Author

Thanks! And congrats on figuring out the issue! I just wanted to say that in my case I worked around the issue by "preprocessing" the data in python (essentially selected the variables I needed and saved them in separate files), so no pressure from me!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
question Further information is requested
Projects
None yet
Development

No branches or pull requests

5 participants