-
Notifications
You must be signed in to change notification settings - Fork 5
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Dataset api #334
Dataset api #334
Conversation
…urning a bool of the status (make it easier to fix errors in an interactive setting).
…stem in the jupyter notebook
…jupyter notebook .
…ization of a Record. Added tests.
…ystem rather than having to worry about passing around units to every function. This might be slightly non-ideal (I generally do not like global variables), but I think for the case of curation it might make the most sense given the hierarchical nature of construction and the need to validate at the level of properties.
The NPZ file that is generated from the dataset is somewhat dynamic (and we can't just look at the md5 checksum). To see if the code needs to regenerate that npz file or just use the one already that exists, we look at a metadata file that is written out (as a json file). This include the md5checksum of the hdf5 file and also the properties of interest that were used to generate the npz file. We need to also put the element filter in there and look at it. If the element filter has changed, we need to regenerate that npz file. |
… from partial charges (and rescaling)
…fixed small bugs in hdf5 writer. hdf5 writer now also includes property type, which should make it easier upon reading to know what a property represents and how to convert it to a desired set of units.
…t sufficiently unique due to saturation of rings in some cases.
…ords are actually are strings.
Pull Request Summary
This adds in a new module called "curate" that provides an API for dataset curation.
The general hierarchy is that each dataset contains records, and each record contains properties. In this, properties are defined using pydantic models (e.g.,
AtomicNumbers
,Energy
,Forces
,PartialCharge
, etc.) . The pydantic property models ensure that for each value we ensure: units, property type (e.g., length, energy, force), and classification (per_atom, per_system), and also validates that the shape of the inputted array matches the expectation related to the classification. Records collect the properties and validate that consistent number of configurations and atoms exist across properties. The dataset class provides functions to write to an hdf5 file, converting units to the specified unit system.The dataset also can be initialize in an "append" mode whereby properties associated with individual configurations can be added separately, and the code will automatically append to the internal numpy array (performing any unit conversion necessary).
This PR is still a WIP, as it will require also converting old curation scripts to use the new api, and to revise the dataset class in model forget to accept the new format (minor changes to the terminology we use for classification).
Key changes
Notable points that this PR has either accomplished or will accomplish.
Questions
Associated Issue(s)
Pull Request Checklist