Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add Zarr v3 dependency #182

Draft
wants to merge 6 commits into
base: main
Choose a base branch
from

Conversation

ghidalgo3
Copy link
Contributor

@ghidalgo3 ghidalgo3 commented Jul 10, 2024

This PR is a pre-cursor to #175 , and will be blocked on at least Kerchunk supporting ZarrV3.

TODOs:

Checklist:

  • Closes Zarr V2 vs V3 #17
  • Tests added
  • Tests passing
  • Full type hint coverage
  • Changes are documented in docs/releases.rst
  • New functions/methods are listed in api.rst
  • New functionality has documentation

docs/releases.rst Outdated Show resolved Hide resolved
@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself upstream issue zarr-python Relevant to zarr-python upstream labels Jul 10, 2024
Slopy copy-paste, thank you Tom

Co-authored-by: Tom Nicholas <[email protected]>
@TomNicholas
Copy link
Member

Adding zarr as a dependency would also allow us to un-vendor this small bit of code

https://github.com/zarr-developers/VirtualiZarr/blob/main/virtualizarr/vendor/zarr/utils.py

@ghidalgo3
Copy link
Contributor Author

I ran into an issue with a behavior change in the call to zarr.group, made by Kerchunk when VirtualiZarr calls SingleHdf5ToZarr. Here's the problem:

  1. VirtualiZarr (either through unit tests tests or a open_virtual_dataset call) will call SingleHdf5ToZarr and pass a out=None argument.
  2. In kerchunk, that out is passed to zarr.group(store=out). In ZarrV2, that None would get converted to an in-memory store that happens to subclass MutableMapping. But in ZarrV3, that gets turned into a StoreLike which doesn't implement MutableMapping.
  3. This eventually causes a problem whenever kerchunk calls _encode_for_JSON on that store because ZarrV3 stores are not so trivial to turn into a dictionary, there's an async enumeration that needs to happen.

@TomNicholas maybe it might be less work globally if VirtualiZarr didn't depend on Kerchunk? I know there's #78 and #87 as prototypes, maybe they are prerequisites to making VirtualiZarr depend on ZarrV3.

The process of using Kerchunk to generate ChunkManifests can be delegated to some other package that can continue depending on ZarrV2.

@TomNicholas
Copy link
Member

@ghidalgo3 that sounds like a gnarly bug!

(It would be nice if python allowed you to have different versions imported for different packages, so that kerchunk could depend on zarr v2 whilst virtualizarr imported zarr v3.)

maybe it might be less work globally if VirtualiZarr didn't depend on Kerchunk?

@ghidalgo3 many things in this library would be less work if we didn't have to depend on kerchunk, and we're working towards that, but it's not going to be done quickly (especially not for other non-HDF filetypes that kerchunk can also read).

Making kerchunk work with zarr v3 would be the cleanest solution, but it's difficult to know how long that might take too (I know Joe tried it and found it to be a rabbit hole).

The process of using Kerchunk to generate ChunkManifests can be delegated to some other package that can continue depending on ZarrV2.

I'm not sure I understand this suggestion. Wouldn't that have exactly the same problem as now?

@ghidalgo3
Copy link
Contributor Author

I'm not sure I understand this suggestion. Wouldn't that have exactly the same problem as now?

Not if you have 2 Python environments:

  1. One with Kerchunk and Zarr V2 and it is responsible for reading files and producing ChunkManifests, and you write these to disk or blobs or whatever.
  2. One with VirtualiZarr and ZarrV3, which takes the ChunkManifests produced by the first step to create the virtual datasets and ultimately read/write data..

It's an ugly solution yes, but compared to making Kerchunk work with ZarrV3 maybe it's less work? I'm not in a rush so probably makes sense to make Kerchunk work with V3 and V2, and that's only going to happen by upgrading Kerchunk to Zarr>=3.0.0.

(It would be nice if python allowed you to have different versions imported for different packages, so that kerchunk could depend on zarr v2 whilst virtualizarr imported zarr v3.)

That would be nice! But even before then it would be better if Zarr v2 -> Zarr v3 had easy breaking changes :( This specific one is really subtle, there's no API change to zarr.group but an internal change to how Stores are handled inside Zarr, which causes a problem for Kerchunk, which causes a problem for VirtualiZarr.

I'll focus my effort on trying to make Kerchunk work with Zarr V3 then.

@TomNicholas
Copy link
Member

Not if you have 2 Python environments:

Ohhh right now I get it - you literally split your workflow up into two separate steps using different environments each time.

an internal change to how Stores are handled inside Zarr, which causes a problem for Kerchunk

I mean to me that sounds like it's kerchunk's fault for using internal zarr API...

I'll focus my effort on trying to make Kerchunk work with Zarr V3 then.

Amazing - let us know how that goes (good or bad) so we can re-evaluate here.

@@ -12,8 +12,7 @@
import ujson # type: ignore
import xarray as xr
from pydantic import BaseModel, ConfigDict, field_validator

from virtualizarr.vendor.zarr.utils import json_dumps
from zarr.v2.util import json_dumps
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I would avoid this import if possible. We are likely to remove the v2 namespace before the 3.0 release.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I'll revert and continue to vendor this bit of code.

@ghidalgo3
Copy link
Contributor Author

@TomNicholas this work did lead down a pretty deep rabbit hole, probably following in Joe's footsteps. My conclusion is that VirtualiZarr should not take a code-level dependency on zarr-python until all of VirtualiZarr's dependencies explicitly support ZarrV3. VirtualiZarr should seek to read/write valid ZarrV3 metadata even if it means duplicating some logic between it and zarr-python.

Specifically here is the problem chain this lead me down:

  • Adding zarr==3.0.0a0 to VirtualiZarr breaks kerchunk, not just because of the module path name changes but also because Stores are no longer MutableMappings zarr-python#2023 means that internally kerchunk's usage of the Store API is broken.
  • Adding a compatibility layer to kerchunk to work with both zarrv2 and zarrv3 can be done Add Zarr compatibility functions fsspec/kerchunk#478 but if you actually try to use kerchunk with zarrv3 it is broken due to the Store changes in ZarrV3.
  • Even if kerchunk's normal usage of zarr supported v3, kerchunk's unit tests depend on xarray.to_zarr which doesn't work at least because in zarrv3 all the API functions require full keyword arguments, let alone type signature changes. That means xarray also needs to support zarrv3.
  • I stopped once I discovered that all of kerchunk, fsspec, and xarray all needed changes to work with ZarrV3 and the scope of this change became clear.

Does that make sense? If so, I I'll abandon both this PR and #175

@TomNicholas
Copy link
Member

That makes complete sense, and you've really done a great job here going down that rabbit hole and clearly communicating all the steps required.

I don't know that we need to close this PR, but it does sound like it will need to be paused for a while before it could be picked up again.

I also agree that in the meantime we can aim for v3 support without a dependency on zarr-python v3

cc @jhamman

@TomNicholas TomNicholas added the dependencies Updates a dependency label Aug 5, 2024
@TomNicholas
Copy link
Member

Coming back to this now as a lot of things have changed, so we might be able to make progress.

if you actually try to use kerchunk with zarrv3 it is broken

This is still the case.

xarray also needs to support zarrv3.

This is about to be the case, and there is a branch we can test it on. pydata/xarray#9552

it might be less work globally if VirtualiZarr didn't depend on Kerchunk

VirtualiZarr now does not depend on Kerchunk! #259 Though if you want to read netCDF files you do have to be able to import kerchunk, the dependency is just optional.

if you have 2 Python environments:

  1. One with Kerchunk and Zarr V2 and it is responsible for reading files and producing ChunkManifests, and you write these to disk or blobs or whatever.
  2. One with VirtualiZarr and ZarrV3, which takes the ChunkManifests produced by the first step to create the virtual datasets and ultimately read/write data..

This is now maybe within reach? The writing to disk could actually be done using the kerchunk format, as virtualizarr can write to and read from the kerchunk json/parquet format without ever importing kerchunk (reading is new, thanks to @norlandrhagen in #251).

At the very least now we should be able to see if new min-deps and upstream CI workflows (#263) pass.

we can aim for v3 support without a dependency on zarr-python v3

Adding a reader that can create virtual references from a standard zarr v3 store is tracked in #262.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
dependencies Updates a dependency Kerchunk Relating to the kerchunk library / specification itself upstream issue zarr-python Relevant to zarr-python upstream
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Zarr V2 vs V3
3 participants