-
Notifications
You must be signed in to change notification settings - Fork 264
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Consolidated Zarr support could improve S3 data loading #2987
Comments
👍 for support of Zarr v2 "consolidated". A discussion on the zarr-python team yesterday touched on how to deal with the potential differences in this respect between possible differences in the definition of consolidated between the v2 and v3 formats. The decision was to add arguments to enable the v2 consolidated format to the v3 library, but potentially disallow those arguments when producing the v3 format (since the v3 library will need to support both the v2 and v3 formats.) |
Sorry, I apparently missed this Issue when it was first posted.
What kind of arguments are being considered? |
No, that has not changed, but agreed that that is a difficult for the v2 format.
Correction. For zarr-python library v2, I should have said "methods" or "API" for activating consolidated metadata. (Those don't yet exist for zarr-python library v3.) The method arguments I was thinking of are in xarray: https://docs.xarray.dev/en/stable/user-guide/io.html#consolidated-metadata |
Ok, I see. |
The discussion around V3 is currently ongoing. It's unlikely that there will be significant work on a V2 "spec". (I would certainly be for having an "upgrade guide" between the two which may be as close as we can come.) |
Thanks for the discussion! In the meantime I've tried to just add a "caching layer" the metadata functions that would GET the .z* files to see what the difference would be [1] . I've opened #2992 but its a draft perhaps not useful on the long term. [1]
|
Hello 👋
We've noticed the difference between reading a remote Zarr dataset [
https://...#mode=s3,zarr
] and local one [file://....#mode=file,zarr
]:Network overhead is expected, specially if the service imposes rate limits. But such a difference motivated me to look at the implementation behaviour.
It seems that the approach used by netcdf is similar to the one used with Python Zarr, fetching all the metadata in advance. And for this reason the following requests are sent (in netcdf) for my the example above:
There are 3x more HEAD than GET which can be a tinny improvement, but overall this is not much different from what Python does:
Which produces:
Implementing a consolidated access mode could improve the situation. In Python, the example above can be simplified to a single request:
/.zmetadata
(Note that not even a HEAD request is done in advance)If this is desired perhaps it could be supported by other modes, like
file
(or evenzip
!?) as well. In that case I think it would be part of the zarr api and not a specific zmap S3 implementation.I will try to come up with a PR for this but it would be great to have some feedback and if positive, some pointers/draft on how to support it (via
#mode=consolidated
controls? Environment? Only when build --with-consolidated-zarr?)Thanks!
The text was updated successfully, but these errors were encountered: