Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

H5MD format rewrite of Zarrtraj #34

Merged
merged 16 commits into from
Jun 29, 2024
Merged

H5MD format rewrite of Zarrtraj #34

merged 16 commits into from
Jun 29, 2024

Conversation

ljwoods2
Copy link
Collaborator

@ljwoods2 ljwoods2 commented Jun 23, 2024

Fixes #26, #27, #33

Changes made in this Pull Request:

  • Complete rewrite of zarrtraj
  • Kerchunk patched temporarily before zarr 3.0 is released OR kerchunk supports links
  • Tests should run faster now- all s3 tests run through one server and one bucket

Won't delete old zarrtraj files until PR is ready

Handling of H5MD flexibility

A ton has changed in this PR so I'll summarize how I'm handling H5MD flexibility here:

  1. Though H5MD says time data is optional, this could lead to problems in the MDAnalysis frame-based model- say an H5MD element is the only element which contains data at a certain integration step. Then MDAnalysis would be forced to produce a timestep without time data. This time can't be inferred since you could have

positions/step: [1, 2, 3]
positions/time: [1.0, 2.0, 3.0]
velocities/step: [1, 2, 3, 100]

Therefore, I think the best thing to do is to just require that all time-dependent data has fixed or explicit time.

  1. Links don't yet exist in Zarr, but in H5MD, it is required that positions/step is a hard link to box/edges/step. I handle this by just making sure the step datasets are all equal. In the kerchunk patch, linked datasets are simply duplicated, so this is the closest to the intended spec I can get on this requirement.

  2. Since there are both "global" and "per-particle group" observables, the "per-particle group" ones are loaded into ts.data as "<group_name>/<dataset_name>" and global ones are loaded as "<dataset_name>" to prevent name mangling

  3. In frames where time-dependent observables don't have any data, the ts.data key for that observable is deleted

  4. ALL datasets that have data of a certain kind (i.e. box and positions both have "length" data) have to have the same unit. You can't have "positions" and "velocities" with different time units though this may be allowed in H5MD.

  5. Though time-independent positions, velocities, and forces are allowed in H5MD, zarrtraj will fail out.

  6. If there is only one group in "/particles", zarrtraj will select it to read. However, if there is more than one, the "group" kwarg must be provided (since H5MD allows any number of groups) else zarrtraj will fail out rather than assume which group the user wanted.

PR Checklist

  • Tests?
  • Docs?
  • CHANGELOG updated?
  • Issue raised/referenced?

@ljwoods2 ljwoods2 marked this pull request as draft June 24, 2024 00:00
Copy link
Collaborator

@hmacdope hmacdope left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Only nitpicks mostly and some gentle encouragement to clean up some cruft. But overall this is looking fantastic. I say go ahead and merge and work from there. We can clean up later.

I also agree with the small liberties taken with the format. We can always re-asses if we come across files that break this.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this cruft?

- pip

### Core requirements ###
- MDAnalysis>=2.7.0
- zarr>=2.11.0
- kerchunk>=0.2.5
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is your kerchunk patch on a branch somewhere? If so we can do a bleeding edge pip based install with

 - pip:
    - git+git://github.com/lawson/kerchnk@my_patch

removing the need for a patch vendored with the code. Alternately you can keep the patch if you feel its easier.

else:
raise ValueError(
"Time units do not match across all time-dependent datasets."
)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Very consistent checking, great!

"""Prepares the correct method for managing the file
given the protocol"""
self._protocol = get_protocol(self.filename)
if self._protocol == "s3":
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Other cloud storage? What happens? Is this easy to abstract over?

if not self._elements[elem].is_time_independent():
if not self._elements[elem].has_time:
raise NoDataError(
"MDAnalysis requires that time data is available "
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think I am fine with this restriction. If we find people are making files without this then we can re-evaluate.

self._open_trajectory()

def Writer(self, filename, n_atoms=None, **kwargs):
"""Return writer for trajectory format
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is docstring accurate? Versionchanged is old, unsure if just missing that

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Tons of the docstrings are just copied from H5MD in the core codebase, haven't even touched them. I'll make sure to update all of these once I'm more confident I won't make massive changes

Comment on lines +712 to +714
# nb, filename strings can still get passed through if
# format='H5MD' is used
pass
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I assume we need this?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, this is actually something I want your thoughts on- should this reader take precedence over the core H5MDReader for reading H5MD files from disk if the user imports it? If so, I can just catch all files with the .zarrmd or .h5md file extension and don't need a format hint. If not, when and how should the user specify they want to use this reader to read their h5md file?

zarrtraj/ZARR.py Outdated
)


class H5MDWriter(base.WriterBase):
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Is this the same as the H5MDWriter from core verbatim? Can we not just import it?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Yes, it is just copied for now to have something to look at while I write the zarr writer, will delete this later

return self._read_frame(self._frame + 1)
if self._frame_seq is None:
self._frame_seq = collections.deque(range(self._n_frames))
print(self._frame_seq)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

remove print when ready

Comment on lines +11 to +23

def create_COORDINATES_MISSING_H5MD_GROUP(root):
del root["h5md"]


def main():
z = zarr.open_group(COORDINATES_SYNTHETIC_ZARRMD, mode="r")
outz = zarr.open_group(
"COORDINATES_MISSING_H5MD_GROUP_ZARRMD.zarrmd", mode="a"
)
zarr.convenience.copy_all(z, outz)
create_COORDINATES_MISSING_H5MD_GROUP(outz)

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can make these and the other scripts testing fixtures.

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If its easier for you, that way you can construct the h5md "file" in place.

@ljwoods2 ljwoods2 marked this pull request as ready for review June 29, 2024 22:15
@ljwoods2 ljwoods2 merged commit 1082ef9 into main Jun 29, 2024
2 of 19 checks passed
@ljwoods2 ljwoods2 deleted the decoupled-cache branch July 10, 2024 03:05
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Tests failing due to misuse of "rmdir"
2 participants