-
Notifications
You must be signed in to change notification settings - Fork 2
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
H5MD format rewrite of Zarrtraj #34
Conversation
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Only nitpicks mostly and some gentle encouragement to clean up some cruft. But overall this is looking fantastic. I say go ahead and merge and work from there. We can clean up later.
I also agree with the small liberties taken with the format. We can always re-asses if we come across files that break this.
PEG_1chain/PEG.prmtop
Outdated
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this cruft?
- pip | ||
|
||
### Core requirements ### | ||
- MDAnalysis>=2.7.0 | ||
- zarr>=2.11.0 | ||
- kerchunk>=0.2.5 |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is your kerchunk patch on a branch somewhere? If so we can do a bleeding edge pip based install with
- pip:
- git+git://github.com/lawson/kerchnk@my_patch
removing the need for a patch vendored with the code. Alternately you can keep the patch if you feel its easier.
else: | ||
raise ValueError( | ||
"Time units do not match across all time-dependent datasets." | ||
) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Very consistent checking, great!
"""Prepares the correct method for managing the file | ||
given the protocol""" | ||
self._protocol = get_protocol(self.filename) | ||
if self._protocol == "s3": |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Other cloud storage? What happens? Is this easy to abstract over?
if not self._elements[elem].is_time_independent(): | ||
if not self._elements[elem].has_time: | ||
raise NoDataError( | ||
"MDAnalysis requires that time data is available " |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think I am fine with this restriction. If we find people are making files without this then we can re-evaluate.
self._open_trajectory() | ||
|
||
def Writer(self, filename, n_atoms=None, **kwargs): | ||
"""Return writer for trajectory format |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is docstring accurate? Versionchanged is old, unsure if just missing that
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Tons of the docstrings are just copied from H5MD in the core codebase, haven't even touched them. I'll make sure to update all of these once I'm more confident I won't make massive changes
# nb, filename strings can still get passed through if | ||
# format='H5MD' is used | ||
pass |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I assume we need this?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, this is actually something I want your thoughts on- should this reader take precedence over the core H5MDReader for reading H5MD files from disk if the user imports it? If so, I can just catch all files with the .zarrmd or .h5md file extension and don't need a format hint. If not, when and how should the user specify they want to use this reader to read their h5md file?
zarrtraj/ZARR.py
Outdated
) | ||
|
||
|
||
class H5MDWriter(base.WriterBase): |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Is this the same as the H5MDWriter from core
verbatim? Can we not just import it?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, it is just copied for now to have something to look at while I write the zarr writer, will delete this later
return self._read_frame(self._frame + 1) | ||
if self._frame_seq is None: | ||
self._frame_seq = collections.deque(range(self._n_frames)) | ||
print(self._frame_seq) |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
remove print when ready
|
||
def create_COORDINATES_MISSING_H5MD_GROUP(root): | ||
del root["h5md"] | ||
|
||
|
||
def main(): | ||
z = zarr.open_group(COORDINATES_SYNTHETIC_ZARRMD, mode="r") | ||
outz = zarr.open_group( | ||
"COORDINATES_MISSING_H5MD_GROUP_ZARRMD.zarrmd", mode="a" | ||
) | ||
zarr.convenience.copy_all(z, outz) | ||
create_COORDINATES_MISSING_H5MD_GROUP(outz) | ||
|
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
You can make these and the other scripts testing fixtures.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
If its easier for you, that way you can construct the h5md "file" in place.
Fixes #26, #27, #33
Changes made in this Pull Request:
Won't delete old zarrtraj files until PR is ready
Handling of H5MD flexibility
A ton has changed in this PR so I'll summarize how I'm handling H5MD flexibility here:
positions/step: [1, 2, 3]
positions/time: [1.0, 2.0, 3.0]
velocities/step: [1, 2, 3, 100]
Therefore, I think the best thing to do is to just require that all time-dependent data has fixed or explicit time.
Links don't yet exist in Zarr, but in H5MD, it is required that positions/step is a hard link to box/edges/step. I handle this by just making sure the step datasets are all equal. In the kerchunk patch, linked datasets are simply duplicated, so this is the closest to the intended spec I can get on this requirement.
Since there are both "global" and "per-particle group" observables, the "per-particle group" ones are loaded into ts.data as "<group_name>/<dataset_name>" and global ones are loaded as "<dataset_name>" to prevent name mangling
In frames where time-dependent observables don't have any data, the ts.data key for that observable is deleted
ALL datasets that have data of a certain kind (i.e. box and positions both have "length" data) have to have the same unit. You can't have "positions" and "velocities" with different time units though this may be allowed in H5MD.
Though time-independent positions, velocities, and forces are allowed in H5MD, zarrtraj will fail out.
If there is only one group in "/particles", zarrtraj will select it to read. However, if there is more than one, the "group" kwarg must be provided (since H5MD allows any number of groups) else zarrtraj will fail out rather than assume which group the user wanted.
PR Checklist