Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Zarr V3 metadata fixes #248

Open
wants to merge 6 commits into
base: main
Choose a base branch
from
Open

Conversation

LDeakin
Copy link

@LDeakin LDeakin commented Oct 6, 2024

This fixes several problems with zarr.json metadata that I noticed when implementing a chunk manifest storage transformer.

  • Chunk key encoding should be v2 with . separator to match manifest.json
    • the chunk-manifest-json storage transformer should not need to be aware of the chunk key encoding
  • Fix the fill value defaulting to NaN for integer arrays
  • Encode non-finite float fill values as strings
  • Default bytes codec "endian" to "little"
    • This needs to be addressed properly at some point
  • Fix tests using null fill value or nan fill value for integer data type

@LDeakin
Copy link
Author

LDeakin commented Oct 6, 2024

Type-checking failures seem unrelated

@LDeakin LDeakin marked this pull request as ready for review October 6, 2024 23:05
@@ -92,8 +92,8 @@ def zarr_v3_array_metadata(zarray: ZArray, dim_names: list[str], attrs: dict) ->
"configuration": {"chunk_shape": metadata.pop("chunks")},
}
metadata["chunk_key_encoding"] = {
"name": "default",
"configuration": {"separator": "/"},
"name": "v2",
Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong? For writing v3 metadata?

In general if we're not planning to use this format any more (see #262 (comment)), how much of this PR do you want to keep @LDeakin ?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Presumably all the rest of the fixes are still relevant?

Copy link
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This seems wrong? For writing v3 metadata?

The chunk manifest example in zarr-developers/zarr-specs#287 and virtualizarr produces "0.0" style chunk key encoding, which is v2 with . separator. default with / would be "c/0/0".

If the chunk key encoding of the array and the chunk manifest matches, then the chunk-manifest-json storage transformer does not need to concern itself with chunk key encodings, which makes sense to me.

In general if we're not planning to use this format any more (see #262 (comment)), how much of this PR do you want to keep @LDeakin ?

Not fussed, this PR was just the minimal changes I needed to use the chunk-manifest-json as currently spec'd and produced by virtualizarr. I'd hope most of these changes would be superseded by bringing in zarr-python V3 as a dependency anyway.

I haven't looked thoroughly at the spec for icechunk yet, but do you see it replacing chunk-manifest-json entirely? Can the time travel stuff be decoupled from the chunk manifests?

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The chunk manifest example in zarr-developers/zarr-specs#287 and virtualizarr produces "0.0" style chunk key encoding, which is v2 with . separator. default with / would be "c/0/0".

My intention was to test out writing to and reading from a v3-compatible json-based chunk manifest spec. If what I actually did looks more like v2 then that's my bad for not understanding the spec properly!

Not fussed, this PR was just the minimal changes I needed to use the chunk-manifest-json as currently spec'd and produced by virtualizarr. I'd hope most of these changes would be superseded by bringing in zarr-python V3 as a dependency anyway.

Okay thanks. Maybe we get virtualizarr working fully, then look at the updated diff, as I would expect @mpiannucci's efforts on icechunk compatibility should iron out similar concerns around fill values?

I'd hope most of these changes would be superseded by bringing in zarr-python V3 as a dependency anyway.

👍 We're close to being able to do that now that zarr-python v3 alpha (beta today actually) is out.

I haven't looked thoroughly at the spec for icechunk yet, but do you see it replacing chunk-manifest-json entirely?

I think that is Earthmover's intention.

Can the time travel stuff be decoupled from the chunk manifests?

In theory it probably could, but in practice unless there is a strong use case for using chunk manifests where you wouldn't also like to have all the other features of icechunk, I'm not really sure why you would bother separting them. All the features of icechunk are closely-related in that they all involve/require adding a new layer of indirection into the store, i.e. the manifests + snapshots (which are kind of like time-stamped consolidated metadata IIUC). This question deserves discussion on that zarr spec proposal issue though.

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This question deserves discussion on that zarr spec proposal issue though.

I've asked in zarr-developers/zarr-specs#287 (comment)

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants