H5MD format rewrite of Zarrtraj #34

ljwoods2 · 2024-06-23T23:59:34Z

Fixes #26, #27, #33

Changes made in this Pull Request:

Complete rewrite of zarrtraj
Kerchunk patched temporarily before zarr 3.0 is released OR kerchunk supports links
Tests should run faster now- all s3 tests run through one server and one bucket

Won't delete old zarrtraj files until PR is ready

Handling of H5MD flexibility

A ton has changed in this PR so I'll summarize how I'm handling H5MD flexibility here:

Though H5MD says time data is optional, this could lead to problems in the MDAnalysis frame-based model- say an H5MD element is the only element which contains data at a certain integration step. Then MDAnalysis would be forced to produce a timestep without time data. This time can't be inferred since you could have

positions/step: [1, 2, 3]
positions/time: [1.0, 2.0, 3.0]
velocities/step: [1, 2, 3, 100]

Therefore, I think the best thing to do is to just require that all time-dependent data has fixed or explicit time.

Links don't yet exist in Zarr, but in H5MD, it is required that positions/step is a hard link to box/edges/step. I handle this by just making sure the step datasets are all equal. In the kerchunk patch, linked datasets are simply duplicated, so this is the closest to the intended spec I can get on this requirement.
Since there are both "global" and "per-particle group" observables, the "per-particle group" ones are loaded into ts.data as "<group_name>/<dataset_name>" and global ones are loaded as "<dataset_name>" to prevent name mangling
In frames where time-dependent observables don't have any data, the ts.data key for that observable is deleted
ALL datasets that have data of a certain kind (i.e. box and positions both have "length" data) have to have the same unit. You can't have "positions" and "velocities" with different time units though this may be allowed in H5MD.
Though time-independent positions, velocities, and forces are allowed in H5MD, zarrtraj will fail out.
If there is only one group in "/particles", zarrtraj will select it to read. However, if there is more than one, the "group" kwarg must be provided (since H5MD allows any number of groups) else zarrtraj will fail out rather than assume which group the user wanted.

PR Checklist

Tests?
Docs?
CHANGELOG updated?
Issue raised/referenced?

hmacdope

Only nitpicks mostly and some gentle encouragement to clean up some cruft. But overall this is looking fantastic. I say go ahead and merge and work from there. We can clean up later.

I also agree with the small liberties taken with the format. We can always re-asses if we come across files that break this.

hmacdope · 2024-06-25T09:11:40Z

PEG_1chain/PEG.prmtop

Is this cruft?

hmacdope · 2024-06-25T09:15:43Z

devtools/conda-envs/test_env.yaml

  - pip

  ### Core requirements ###
  - MDAnalysis>=2.7.0
  - zarr>=2.11.0
+  - kerchunk>=0.2.5


Is your kerchunk patch on a branch somewhere? If so we can do a bleeding edge pip based install with

- pip: - git+git://github.com/lawson/kerchnk@my_patch

removing the need for a patch vendored with the code. Alternately you can keep the patch if you feel its easier.

hmacdope · 2024-06-25T09:21:52Z

zarrtraj/ZARR.py

+        else:
+            raise ValueError(
+                "Time units do not match across all time-dependent datasets."
+            )


Very consistent checking, great!

hmacdope · 2024-06-25T09:22:30Z

zarrtraj/ZARR.py

+        """Prepares the correct method for managing the file
+        given the protocol"""
+        self._protocol = get_protocol(self.filename)
+        if self._protocol == "s3":


Other cloud storage? What happens? Is this easy to abstract over?

hmacdope · 2024-06-25T09:23:38Z

zarrtraj/ZARR.py

+            if not self._elements[elem].is_time_independent():
+                if not self._elements[elem].has_time:
+                    raise NoDataError(
+                        "MDAnalysis requires that time data is available "


I think I am fine with this restriction. If we find people are making files without this then we can re-evaluate.

hmacdope · 2024-06-25T09:24:36Z

zarrtraj/ZARR.py

+        self._open_trajectory()
+
+    def Writer(self, filename, n_atoms=None, **kwargs):
+        """Return writer for trajectory format


Is docstring accurate? Versionchanged is old, unsure if just missing that

Tons of the docstrings are just copied from H5MD in the core codebase, haven't even touched them. I'll make sure to update all of these once I'm more confident I won't make massive changes

hmacdope · 2024-06-25T09:26:34Z

zarrtraj/ZARR.py

+        # nb, filename strings can still get passed through if
+        # format='H5MD' is used
+        pass


I assume we need this?

Yes, this is actually something I want your thoughts on- should this reader take precedence over the core H5MDReader for reading H5MD files from disk if the user imports it? If so, I can just catch all files with the .zarrmd or .h5md file extension and don't need a format hint. If not, when and how should the user specify they want to use this reader to read their h5md file?

hmacdope · 2024-06-25T09:27:56Z

zarrtraj/ZARR.py

+        )
+
+
+class H5MDWriter(base.WriterBase):


Is this the same as the H5MDWriter from core verbatim? Can we not just import it?

Yes, it is just copied for now to have something to look at while I write the zarr writer, will delete this later

hmacdope · 2024-06-25T09:28:32Z

zarrtraj/ZARRTRAJ.py

-        return self._read_frame(self._frame + 1)
+        if self._frame_seq is None:
+            self._frame_seq = collections.deque(range(self._n_frames))
+        print(self._frame_seq)


remove print when ready

hmacdope · 2024-06-25T09:31:04Z

zarrtraj/data/create_COORDINATES_MISSING_H5MD_GROUP.py

+
+def create_COORDINATES_MISSING_H5MD_GROUP(root):
+    del root["h5md"]
+
+
+def main():
+    z = zarr.open_group(COORDINATES_SYNTHETIC_ZARRMD, mode="r")
+    outz = zarr.open_group(
+        "COORDINATES_MISSING_H5MD_GROUP_ZARRMD.zarrmd", mode="a"
+    )
+    zarr.convenience.copy_all(z, outz)
+    create_COORDINATES_MISSING_H5MD_GROUP(outz)
+


You can make these and the other scripts testing fixtures.

If its easier for you, that way you can construct the h5md "file" in place.

…to decoupled-cache

ljwoods2 added 12 commits May 24, 2024 18:11

Merge remote-tracking branch 'origin/main' into develop

c707dac

initial commit

46f6cd7

initial attempt cont.

89643f4

added chunks to reader.writer

05969b9

kerchunk experiments

50d6aa8

testfiles

7f1af79

testing h5mdreader

40d7dbe

h5md streaming working, no tests

ee33518

passing reader api

abb9769

testing updates

f445305

reader test coverage

fcbf4d6

docs

5539c54

ljwoods2 marked this pull request as draft June 24, 2024 00:00

hmacdope approved these changes Jun 25, 2024

View reviewed changes

ljwoods2 and others added 4 commits June 25, 2024 09:46

delete unused testfile

bbc7977

prep for writer

a20ba19

Merge branch 'decoupled-cache' of github.com:Becksteinlab/zarrtraj in…

11bd000

…to decoupled-cache

Merge branch 'main' into decoupled-cache

1354a62

ljwoods2 marked this pull request as ready for review June 29, 2024 22:15

ljwoods2 merged commit 1082ef9 into main Jun 29, 2024
2 of 19 checks passed

ljwoods2 deleted the decoupled-cache branch July 10, 2024 03:05

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

H5MD format rewrite of Zarrtraj #34

H5MD format rewrite of Zarrtraj #34

ljwoods2 commented Jun 23, 2024 •

edited

Loading

hmacdope left a comment

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

ljwoods2 Jun 25, 2024

hmacdope Jun 25, 2024

ljwoods2 Jun 25, 2024

hmacdope Jun 25, 2024

ljwoods2 Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

hmacdope Jun 25, 2024

		)


		class H5MDWriter(base.WriterBase):

H5MD format rewrite of Zarrtraj #34

H5MD format rewrite of Zarrtraj #34

Conversation

ljwoods2 commented Jun 23, 2024 • edited Loading

Handling of H5MD flexibility

PR Checklist

hmacdope left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

ljwoods2 commented Jun 23, 2024 •

edited

Loading