Read with xtensor-zarr, support v3 #34

davidbrochart · 2021-04-20T21:26:25Z

It looks like xtensor-zarr cannot read zarr v2 written by z5py using blosc compression. What I can see:

z5py writes "fill_value": 0.0, which is not a valid unsigned 8-bit integer value, although it can be cast to it.
all chunk data files written by z5py have 16 more bytes (zeros) at the end, compared to all other implementations. I'm not sure this is a problem though.

grlee77 · 2021-04-21T01:33:03Z

Thanks for working on this! I see the same behavior locally as on the CI here.

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Apparently BLOSC_MAX_OVERHEAD is 16

grlee77

This looks good to me. We can either wait for a fix of the failing z5py case or mark it as a known failure for now and enable the test again later.

grlee77 · 2021-04-21T01:39:28Z

test/test_read_all.py

+def read_with_xtensor_zarr(fpath, ds_name):
+    if ds_name == "blosc":
+        ds_name = "blosc/lz4"
+    fname = "a.npz"
+    if os.path.exists(fname):
+        os.remove(fname)
+    subprocess.check_call(["generate_data/xtensor_zarr/build/run_xtensor_zarr", fpath, ds_name])
+    return np.load(fname)["a"]
+


This is a nice approach. I see that you built this .npz writer into the main.cpp program whenever two command line arguments are provided.

Yes, we use the same executable for writing and reading.

davidbrochart · 2021-04-21T06:41:16Z

I could be wrong (I haven't worked with the BLOSC library previously), but it might be related to the addition of BLOSC_MAX_OVERHEAD in this line?. (I think it was correct to include this BLOSC_MAX_OVERHEAD in the sizeOut computation several lines above, but it does not need to also be added to sizeCompressed).

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

constantinpape · 2021-04-21T07:14:52Z

Thanks for looking into it @grlee77. zarr-python seems to be more tolerant, as it can read z5py-zarr-blosc successfully, but that doesn't mean these trailing 16 bytes are valid. I'll investigate before we merge this PR.

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

davidbrochart · 2021-04-21T07:28:25Z

python-blosc doesn't seem to be able to read the chunks either:

>>> import blosc
>>> with open("data/z5py.zr/blosc/lz4/0.0.0", "rb") as f:
...     data = f.read()
... 
>>> d = blosc.decompress(data)
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "/home/david/mambaforge/envs/zarr_implementations-dev/lib/python3.8/site-packages/blosc/toplevel.py", line 594, in decompress
    return _ext.decompress(bytes_like, as_bytearray)
blosc_extension.error: Error 10032 : not a Blosc buffer or header info is corrupted

@constantinpape I think you should not store the maximum possible size of the compressed data. Blosc gives you the actual size after it has compressed, you can see how we do it in xtensor-io.

constantinpape · 2021-04-21T07:47:30Z

Thanks, I will have a look.

grlee77 · 2021-04-21T12:49:47Z

Let me know what you find, should hopefully be a simple fix in z5. Maybe I am accidentally padding something when writing blosc.

Hopefully it is as simple as removing BLOSC_MAX_OVERHEAD from this line

constantinpape · 2021-04-21T21:18:12Z

I had an initial look and I can reproduce this locally. Unfortunately it looks like just removing the overhead here does not fix the issue.
But maybe I missed something, gonna have a closer look when I have time (not sure if I ll manage before the weekend.)

constantinpape · 2021-04-26T11:35:50Z

I had another look, and removing the BLOSC_MAX_OVERHEAD fixes the issue; I just didn't see it locally for some other reason.
I am trying to fix the windows ci in z5 (constantinpape/z5#181) and will then draft a release to fix the tests here as well.

constantinpape · 2021-04-26T11:51:20Z

Ok, I drafted a new release.
(Windows build issues seem to be unrelated, ilammy/msvc-dev-cmd#34)
Should be on conda forge in the next few days, then we can update the dependencies here.

constantinpape · 2021-04-26T17:55:36Z

I updated the env so that the correct z5py version is installed. That seems to work, but now a bunch of other tests fail.
I am not quite sure what's going on.

davidbrochart · 2021-04-26T18:09:04Z

It looks like there is a new nested parameter for the read functions? I will need to implement that.

constantinpape · 2021-04-26T18:14:10Z

It looks like there is a new nested parameter for the read functions? I will need to implement that.

I see, that's probably due to the zarr release that happened in the meantime. This might be solved by #33 already and it would be enough to rebase onto master.
Maybe @grlee77 or @joshmoore could comment.

davidbrochart · 2021-04-26T18:26:45Z

I just rebased but that is not enough. I guess that this nested argument is related to the dimension separator, I'm going to look into it.

grlee77 · 2021-04-26T21:02:54Z

@davidbrochart, I can help take a look at this. I think the issue is that the nested vs. flat here was implemented prior to dimension_separator, so I was relying on parsing the filenames to determine what the separator was.

Basically there are v2 files where 'dimension_separator' is '/', but I don't think the .zr files currently have that attribute in the JSON, it is just being manually specified at read time like here based on the filename having "nested" or "flat" in it:

zarr_implementations/test/test_read_all.py

Lines 73 to 84 in 7349e03

    
           if nested: 
        
               if 'FSStore' in str(fpath): 
        
                   store = zarr.storage.FSStore( 
        
                       os.fspath(fpath), key_separator='/', mode='r' 
        
                   ) 
        
               else: 
        
                   store = zarr.storage.NestedDirectoryStore(os.fspath(fpath)) 
        
           else: 
        
               if 'FSStore' in str(fpath): 
        
                   store = zarr.storage.FSStore(os.fspath(fpath)) 
        
               else: 
        
                   store = zarr.storage.DirectoryStore(fpath)

I think Josh said that key_separator argument may be going away, so I should update the zarr python generators/tests to rely on the dimension_separator key instead.

grlee77 · 2021-04-27T01:35:38Z

This PR will require zarr-python >= 2.8 so that the dimension_separator metadata key gets used. The conda-forge package for 2.8 was just uploaded, so it should work now. See: davidbrochart#1

zarr 2.8 introduces the dimension_separator metadata key

davidbrochart · 2021-04-27T06:38:05Z

Thanks @grlee77, all green now!

joshmoore · 2021-04-27T06:40:19Z

🚀 Merging so I can rebase the jzarr PR.

grlee77 reviewed Apr 21, 2021

View reviewed changes

davidbrochart marked this pull request as draft April 21, 2021 06:41

grlee77 mentioned this pull request Apr 21, 2021

Add nested data generation and tests (zarr-python and zarrita) #33

Merged

constantinpape mentioned this pull request Apr 26, 2021

Implement read tests for non python implementations #25

Open

davidbrochart and others added 2 commits April 26, 2021 20:17

Read with xtensor-zarr, support v3

1c2af89

Update environment.yml

676f7d1

davidbrochart force-pushed the read_with_xtensor_zarr branch from 366b0b9 to 676f7d1 Compare April 26, 2021 18:17

Add nested argument

29dde90

update zarr requirement in environment.yml (#1)

7808eaf

zarr 2.8 introduces the dimension_separator metadata key

davidbrochart marked this pull request as ready for review April 27, 2021 06:38

joshmoore merged commit 0bffc16 into zarr-developers:master Apr 27, 2021

davidbrochart deleted the read_with_xtensor_zarr branch April 27, 2021 06:41

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Read with xtensor-zarr, support v3 #34

Read with xtensor-zarr, support v3 #34

davidbrochart commented Apr 20, 2021 •

edited

Loading

grlee77 commented Apr 21, 2021 •

edited

Loading

grlee77 left a comment

grlee77 Apr 21, 2021

davidbrochart Apr 21, 2021

davidbrochart commented Apr 21, 2021 •

edited

Loading

constantinpape commented Apr 21, 2021

davidbrochart commented Apr 21, 2021

constantinpape commented Apr 21, 2021

grlee77 commented Apr 21, 2021

constantinpape commented Apr 21, 2021

constantinpape commented Apr 26, 2021

constantinpape commented Apr 26, 2021

constantinpape commented Apr 26, 2021

davidbrochart commented Apr 26, 2021

constantinpape commented Apr 26, 2021

davidbrochart commented Apr 26, 2021

grlee77 commented Apr 26, 2021

grlee77 commented Apr 27, 2021

davidbrochart commented Apr 27, 2021

joshmoore commented Apr 27, 2021

Read with xtensor-zarr, support v3 #34

Read with xtensor-zarr, support v3 #34

Conversation

davidbrochart commented Apr 20, 2021 • edited Loading

grlee77 commented Apr 21, 2021 • edited Loading

grlee77 left a comment

Choose a reason for hiding this comment

grlee77 Apr 21, 2021

Choose a reason for hiding this comment

davidbrochart Apr 21, 2021

Choose a reason for hiding this comment

davidbrochart commented Apr 21, 2021 • edited Loading

constantinpape commented Apr 21, 2021

davidbrochart commented Apr 21, 2021

constantinpape commented Apr 21, 2021

grlee77 commented Apr 21, 2021

constantinpape commented Apr 21, 2021

constantinpape commented Apr 26, 2021

constantinpape commented Apr 26, 2021

constantinpape commented Apr 26, 2021

davidbrochart commented Apr 26, 2021

constantinpape commented Apr 26, 2021

davidbrochart commented Apr 26, 2021

grlee77 commented Apr 26, 2021

grlee77 commented Apr 27, 2021

davidbrochart commented Apr 27, 2021

joshmoore commented Apr 27, 2021

davidbrochart commented Apr 20, 2021 •

edited

Loading

grlee77 commented Apr 21, 2021 •

edited

Loading

davidbrochart commented Apr 21, 2021 •

edited

Loading