Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Read with xtensor-zarr, support v3 #34

Merged

Conversation

davidbrochart
Copy link
Contributor

@davidbrochart davidbrochart commented Apr 20, 2021

It looks like xtensor-zarr cannot read zarr v2 written by z5py using blosc compression. What I can see:

  • z5py writes "fill_value": 0.0, which is not a valid unsigned 8-bit integer value, although it can be cast to it.
  • all chunk data files written by z5py have 16 more bytes (zeros) at the end, compared to all other implementations. I'm not sure this is a problem though.

@grlee77
Copy link
Contributor

grlee77 commented Apr 21, 2021

Thanks for working on this! I see the same behavior locally as on the CI here.

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Apparently BLOSC_MAX_OVERHEAD is 16

Copy link
Contributor

@grlee77 grlee77 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This looks good to me. We can either wait for a fix of the failing z5py case or mark it as a known failure for now and enable the test again later.

Comment on lines 81 to 123
def read_with_xtensor_zarr(fpath, ds_name):
if ds_name == "blosc":
ds_name = "blosc/lz4"
fname = "a.npz"
if os.path.exists(fname):
os.remove(fname)
subprocess.check_call(["generate_data/xtensor_zarr/build/run_xtensor_zarr", fpath, ds_name])
return np.load(fname)["a"]

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a nice approach. I see that you built this .npz writer into the main.cpp program whenever two command line arguments are provided.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, we use the same executable for writing and reading.

@davidbrochart
Copy link
Contributor Author

davidbrochart commented Apr 21, 2021

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

@davidbrochart davidbrochart marked this pull request as draft April 21, 2021 06:41
@constantinpape
Copy link
Collaborator

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

@davidbrochart
Copy link
Contributor Author

python-blosc doesn't seem to be able to read the chunks either:

>>> import blosc
>>> with open("data/z5py.zr/blosc/lz4/0.0.0", "rb") as f:
...     data = f.read()
... 
>>> d = blosc.decompress(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/mambaforge/envs/zarr_implementations-dev/lib/python3.8/site-packages/blosc/toplevel.py", line 594, in decompress
    return _ext.decompress(bytes_like, as_bytearray)
blosc_extension.error: Error 10032 : not a Blosc buffer or header info is corrupted

@constantinpape I think you should not store the maximum possible size of the compressed data. Blosc gives you the actual size after it has compressed, you can see how we do it in xtensor-io.

@constantinpape
Copy link
Collaborator

Thanks, I will have a look.

@grlee77
Copy link
Contributor

grlee77 commented Apr 21, 2021

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

Hopefully it is as simple as removing BLOSC_MAX_OVERHEAD from this line

@constantinpape
Copy link
Collaborator

I had an initial look and I can reproduce this locally. Unfortunately it looks like just removing the overhead here does not fix the issue.
But maybe I missed something, gonna have a closer look when I have time (not sure if I ll manage before the weekend.)

@constantinpape
Copy link
Collaborator

I had another look, and removing the BLOSC_MAX_OVERHEAD fixes the issue; I just didn't see it locally for some other reason.
I am trying to fix the windows ci in z5 (constantinpape/z5#181) and will then draft a release to fix the tests here as well.

@constantinpape
Copy link
Collaborator

Ok, I drafted a new release.
(Windows build issues seem to be unrelated, ilammy/msvc-dev-cmd#34)
Should be on conda forge in the next few days, then we can update the dependencies here.

@constantinpape
Copy link
Collaborator

I updated the env so that the correct z5py version is installed. That seems to work, but now a bunch of other tests fail.
I am not quite sure what's going on.

@davidbrochart
Copy link
Contributor Author

It looks like there is a new nested parameter for the read functions? I will need to implement that.

@constantinpape
Copy link
Collaborator

It looks like there is a new nested parameter for the read functions? I will need to implement that.

I see, that's probably due to the zarr release that happened in the meantime. This might be solved by #33 already and it would be enough to rebase onto master.
Maybe @grlee77 or @joshmoore could comment.

@davidbrochart davidbrochart force-pushed the read_with_xtensor_zarr branch from 366b0b9 to 676f7d1 Compare April 26, 2021 18:17
@davidbrochart
Copy link
Contributor Author

I just rebased but that is not enough. I guess that this nested argument is related to the dimension separator, I'm going to look into it.

@grlee77
Copy link
Contributor

grlee77 commented Apr 26, 2021

@davidbrochart, I can help take a look at this. I think the issue is that the nested vs. flat here was implemented prior to dimension_separator, so I was relying on parsing the filenames to determine what the separator was.

Basically there are v2 files where 'dimension_separator' is '/', but I don't think the .zr files currently have that attribute in the JSON, it is just being manually specified at read time like here based on the filename having "nested" or "flat" in it:

if nested:
if 'FSStore' in str(fpath):
store = zarr.storage.FSStore(
os.fspath(fpath), key_separator='/', mode='r'
)
else:
store = zarr.storage.NestedDirectoryStore(os.fspath(fpath))
else:
if 'FSStore' in str(fpath):
store = zarr.storage.FSStore(os.fspath(fpath))
else:
store = zarr.storage.DirectoryStore(fpath)

I think Josh said that key_separator argument may be going away, so I should update the zarr python generators/tests to rely on the dimension_separator key instead.

@grlee77
Copy link
Contributor

grlee77 commented Apr 27, 2021

This PR will require zarr-python >= 2.8 so that the dimension_separator metadata key gets used. The conda-forge package for 2.8 was just uploaded, so it should work now. See: davidbrochart#1

zarr 2.8 introduces the dimension_separator metadata key
@davidbrochart
Copy link
Contributor Author

Thanks @grlee77, all green now!

@davidbrochart davidbrochart marked this pull request as ready for review April 27, 2021 06:38
@joshmoore
Copy link
Member

🚀 Merging so I can rebase the jzarr PR.

@joshmoore joshmoore merged commit 0bffc16 into zarr-developers:master Apr 27, 2021
@davidbrochart davidbrochart deleted the read_with_xtensor_zarr branch April 27, 2021 06:41
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

4 participants