monkey patching h5py #58

magland · 2024-04-25T13:58:16Z

magland
Apr 25, 2024
Maintainer

I made some PRs, but they are not ready for review.

I'm trying to get spyglass to work with .lindi.json files, but I am realizing that it's going to be pretty ugly to modify the code to use either h5py or lindi depending on the file type. For example, right now the path (string) is passed in to pynwb.NWBHDF5IO, so that would need to be refactored in a messy way throughout.

As a possible solution, I'm working on monkey-patching h5py so that it will automatically use LindiH5pyFile in the case where the file has extension .lindi.json. This would simplify things a lot as spyglass could essentially be left as is. More generally, this would make lindi much easier to use with existing code bases that use h5py.

But what this means is that lindi must support the write modes (r+, w, w-, x, a) in a way that actually modifies the underlying .json file. I was hesitant about this since modifying a large-ish json every time there is a write operation is not ideal (as a side note, I remember you made a comment in on of the PRs that the existing way of just modifying the in-memory rfs is not intuitive). But h5py.File has a flush() function that can be called at any time during writing and is also called on close. I think a good solution could be to write the .json file on every flush. Consolidation of chunks can also happen at this time as well.

The other consideration is that the h5py interface has no mechanism for setting a staging area for chunks. So I'm trying out a system where the staging area is by default [filename].lindi.json.d , a directory that is adjacent to the json file.

So it would work like this

import lindi
import numpy as np
import h5py

lindi.apply_h5py_patch()


def test_patch():
    fname = 'test.lindi.json'
    with h5py.File(fname, 'w') as f:
        f.create_dataset('data', data=np.arange(500000, dtype=np.uint32), chunks=(100000,))

    with h5py.File(fname, 'r') as f:
        ds = f['data']
        assert isinstance(ds, h5py.Dataset)
        assert ds.shape == (500000,)
        assert ds.chunks == (100000,)
        assert ds.dtype == np.uint32
        assert np.all(ds[:] == np.arange(500000, dtype=np.uint32))


if __name__ == '__main__':
    test_patch()

And the resulting file would be test.lindi.json

{
  "refs": {
    ".zgroup": {
      "zarr_format": 2
    },
    "data/.zarray": {
      "chunks": [
        100000
      ],
      "compressor": {
        "blocksize": 0,
        "clevel": 5,
        "cname": "lz4",
        "id": "blosc",
        "shuffle": 1
      },
      "dtype": "<u4",
      "fill_value": 0,
      "filters": null,
      "order": "C",
      "shape": [
        500000
      ],
      "zarr_format": 2
    },
    "data/0": [
      "{{u1}}",
      0,
      3464
    ],
    "data/1": [
      "{{u1}}",
      3464,
      3476
    ],
    "data/2": [
      "{{u1}}",
      6940,
      3467
    ],
    "data/3": [
      "{{u1}}",
      10407,
      3476
    ],
    "data/4": [
      "{{u1}}",
      13883,
      3472
    ]
  },
  "templates": {
    "u1": "/home/magland/src/lindi/test.lindi.json.d/data/consolidated.nekx99xc.0"
  }
}

rly · 2024-04-25T19:30:16Z

rly
Apr 25, 2024
Maintainer

For example, right now the path (string) is passed in to pynwb.NWBHDF5IO, so that would need to be refactored in a messy way throughout.

This refactoring could be messy, but I am imagining this would take the form of a helper function get_file that takes an .nwb path string or a nwb.lindi.json path string and returns either the string for an .nwb path or a LindiH5pyFile. That would get passed to NWBHDF5IO. get_file would then have to be called before every NWBHDF5IO call. That doesn't seem too bad of a refactoring to me, but maybe I am missing something?

I agree that monkey-patching is the most straightforward approach and eases adoption, but I worry that it may be confusing to downstream users who see h5py and expect normal h5py behavior. As an analogy, if I wanted to have my library use numpy, but with extra features, I think creating adapter/wrapper classes around numpy that adds those extra features and refactoring my library to use those classes would be more clear than monkey-patching.

Wikipedia also lists some other pitfalls with monkey-patching regarding name clashes and future-proofing as the patched library changes behavior. I think those are less of an issue here, but worth considering.

0 replies

magland · 2024-04-25T19:41:51Z

magland
Apr 25, 2024
Maintainer Author

This refactoring could be messy, but I am imagining this would take the form of a helper function get_file that takes an .nwb path string or a nwb.lindi.json path string and returns either the string for an .nwb path or a LindiH5pyFile. That would get passed to NWBHDF5IO. get_file would then have to be called before every NWBHDF5IO call. That doesn't seem too bad of a refactoring to me, but maybe I am missing something?

I believe that NWBHDF5IO now accepts exactly one of path or file kwarg... so you can't just pass in a string or a file depending on the case. But the real annoying thing is that you need to close the file once the io goes out of context.

There are also other places in spyglass where h5py is used without pynwb.

4 replies

oruebel Apr 25, 2024
Maintainer

But the real annoying thing is that you need to close the file once the io goes out of context.

https://github.com/hdmf-dev/hdmf/pull/882/files this PR may be relevant for this. What this PR did is that the io object used for reading is automatically added as an attribute on the main Container that is being read, i.e., in PyNWB that is the NWBFile. This does two things: 1) the io doesn't just go out of scope as long as the NWBFile is still there, so the io is not garbage collected while the NWBFile is still there, and 2) you can easily get to the io object used for reading by calling the NWBFile.read_io property.

oruebel Apr 25, 2024
Maintainer

I believe that NWBHDF5IO now accepts exactly one of path or file kwarg... so you can't just pass in a string or a file depending on the case.

I believe this is correct. It is up to the caller to decide whether they want to hand in an already opened file or a path

rly Apr 25, 2024
Maintainer

I believe that NWBHDF5IO now accepts exactly one of path or file kwarg... so you can't just pass in a string or a file depending on the case.

We could pass in a different dictionary of kwargs depending on the case. I think the main change point is:
https://github.com/LorenFrankLab/spyglass/blob/master/src/spyglass/utils/nwb_helper_fn.py#L65-L68

It seems like there are some cases in Spyglass where that function should be called instead of calling NWBHDF5IO directly for read. For appending data, that's probably more of a case-by-case refactoring...

Or you could wrap NWBHDF5IO and replace calls to NWBHDF5IO with a new NWBLINDIIO (terrible name) or something. (Ideally, this would all be handled within PyNWB itself, but I am hesitant to incorporate LINDI into PyNWB until after we do more testing.)

But the real annoying thing is that you need to close the file once the io goes out of context.

Could you elaborate?

rly Apr 25, 2024
Maintainer

^ that solution does not help with the three calls to h5py that I see though.

magland · 2024-04-25T19:46:13Z

magland
Apr 25, 2024
Maintainer Author

@rly yeah there are some definite pitfalls. I'll think about this some more.

0 replies

oruebel · 2024-04-25T20:07:43Z

oruebel
Apr 25, 2024
Maintainer

I agree that monkey-patching is the most straightforward approach and eases adoption, but I worry that it may be confusing to downstream users who see h5py and expect normal h5py behavior.

Part of the concern is, that you want LINDI to be seen as a "good-citizen" in the broader ecosystem, i.e., we should avoid modifying the behavior of other libraries without making it explicit that those changes are happening. Creating adapter/wrapper classes is one way to make this explicit because a user clearly sees that they are using a different type. If that is possible, then I think it is a good choice. In general, monkey-patching is often a convenience, but I try to avoid it if possible, because it is in-transparent. If we really decide that monkey-patching is the way to go, then I think the question is whether we can make it explicit to the user that this is happening. E.g., would it possible to require the user to explicitly initiate the monkey-patch, e.g., by requiring the user to import a particular file, e.g, from lindi import patch_h5py, such that the user can decide whether to patch or not.

0 replies

magland · 2024-04-25T20:39:12Z

magland
Apr 25, 2024
Maintainer Author

Okay makes sense. I'll move away from the monkey patch idea, and provide a more involved but more explicit PR to spyglass.

0 replies

magland · 2024-04-25T21:40:49Z

magland
Apr 25, 2024
Maintainer Author

See the proposed spyglass changes here

LorenFrankLab/spyglass#947

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

monkey patching h5py #58

{{title}}

{{editor}}'s edit

{{editor}}'s edit

Replies: 6 comments 4 replies

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

{{title}}

Select a reply

monkey patching h5py #58

magland Apr 25, 2024 Maintainer

Replies: 6 comments · 4 replies

rly Apr 25, 2024 Maintainer

magland Apr 25, 2024 Maintainer Author

oruebel Apr 25, 2024 Maintainer

oruebel Apr 25, 2024 Maintainer

rly Apr 25, 2024 Maintainer

rly Apr 25, 2024 Maintainer

magland Apr 25, 2024 Maintainer Author

oruebel Apr 25, 2024 Maintainer

magland Apr 25, 2024 Maintainer Author

magland Apr 25, 2024 Maintainer Author

magland
Apr 25, 2024
Maintainer

Replies: 6 comments 4 replies

rly
Apr 25, 2024
Maintainer

magland
Apr 25, 2024
Maintainer Author

oruebel Apr 25, 2024
Maintainer

oruebel Apr 25, 2024
Maintainer

rly Apr 25, 2024
Maintainer

rly Apr 25, 2024
Maintainer

magland
Apr 25, 2024
Maintainer Author

oruebel
Apr 25, 2024
Maintainer

magland
Apr 25, 2024
Maintainer Author

magland
Apr 25, 2024
Maintainer Author