Dataset api #334

chrisiacovella · 2025-01-17T03:29:48Z

Pull Request Summary

This adds in a new module called "curate" that provides an API for dataset curation.

The general hierarchy is that each dataset contains records, and each record contains properties. In this, properties are defined using pydantic models (e.g., AtomicNumbers, Energy, Forces, PartialCharge, etc.) . The pydantic property models ensure that for each value we ensure: units, property type (e.g., length, energy, force), and classification (per_atom, per_system), and also validates that the shape of the inputted array matches the expectation related to the classification. Records collect the properties and validate that consistent number of configurations and atoms exist across properties. The dataset class provides functions to write to an hdf5 file, converting units to the specified unit system.

The dataset also can be initialize in an "append" mode whereby properties associated with individual configurations can be added separately, and the code will automatically append to the internal numpy array (performing any unit conversion necessary).

This PR is still a WIP, as it will require also converting old curation scripts to use the new api, and to revise the dataset class in model forget to accept the new format (minor changes to the terminology we use for classification).

Key changes

Notable points that this PR has either accomplished or will accomplish.

Change 1

Questions

Question 1

Associated Issue(s)

Dataset API #321

Pull Request Checklist

Issue(s) raised/addressed and linked
Includes appropriate unit test(s)
Appropriate docstring(s) added/updated
Appropriate .rst doc file(s) added/updated
PR is ready for review

…urning a bool of the status (make it easier to fix errors in an interactive setting).

…ing properties.

…stem in the jupyter notebook

…jupyter notebook .

codecov-commenter · 2025-01-17T05:29:14Z

Codecov Report

All modified and coverable lines are covered by tests ✅

Project coverage is 79.07%. Comparing base (f013190) to head (a581f5c).

Additional details and impacted files

…ization of a Record. Added tests.

…ystem rather than having to worry about passing around units to every function. This might be slightly non-ideal (I generally do not like global variables), but I think for the case of curation it might make the most sense given the hierarchical nature of construction and the need to validate at the level of properties.

chrisiacovella · 2025-01-21T22:09:45Z

The NPZ file that is generated from the dataset is somewhat dynamic (and we can't just look at the md5 checksum). To see if the code needs to regenerate that npz file or just use the one already that exists, we look at a metadata file that is written out (as a json file). This include the md5checksum of the hdf5 file and also the properties of interest that were used to generate the npz file. We need to also put the element filter in there and look at it. If the element filter has changed, we need to regenerate that npz file.

… from partial charges (and rescaling)

…fixed small bugs in hdf5 writer. hdf5 writer now also includes property type, which should make it easier upon reading to know what a property represents and how to convert it to a desired set of units.

…t sufficiently unique due to saturation of rings in some cases.

…ords are actually are strings.

chrisiacovella added 9 commits December 18, 2024 20:58

Initial commit of dataset API

ab608ba

Adding in doc strings; change to logging errors in validation and ret…

f1d0d92

…urning a bool of the status (make it easier to fix errors in an interactive setting).

Adding in jupyter notebook examples. Fixing missing deepcopy when add…

8a36c5f

…ing properties.

adding tests

431c47d

adding more tests.

de42c7f

Adding unit validation at record level; adding description of unit sy…

9393197

…stem in the jupyter notebook

Additional tests, updates to how record access works, updates to the …

4649d85

…jupyter notebook .

more tests

06acd51

more tests

1ef3005

chrisiacovella added 3 commits January 17, 2025 11:53

move add_properties logic to the Record class to allow direct initial…

ab34fd1

…ization of a Record. Added tests.

Updated tests and jupyter notebook

ca3a5f1

chrisiacovella added 6 commits January 21, 2025 23:38

added in qm_curation as test; revised properties/units more as a result

87ec88d

curation baseclass now has a function for computing the dipole moment…

6b3b1ca

… from partial charges (and rescaling)

Merge remote-tracking branch 'origin/main' into dataset_api

e21f6c4

added in "new" curation scripts for qm9, spice 1, spice 2, and tmqm. …

5a96a57

…fixed small bugs in hdf5 writer. hdf5 writer now also includes property type, which should make it easier upon reading to know what a property represents and how to convert it to a desired set of units.

aded phalkethoh; fixed some issues where the record name prefix is no…

08d50a9

…t sufficiently unique due to saturation of rings in some cases.

added ani2x curation; additional assert statements ensure name of rec…

a581f5c

…ords are actually are strings.

chrisiacovella merged commit a581f5c into choderalab:main Jan 29, 2025
10 checks passed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Dataset api #334

Dataset api #334

chrisiacovella commented Jan 17, 2025

codecov-commenter commented Jan 17, 2025 •

edited

Loading

chrisiacovella commented Jan 21, 2025

Dataset api #334

Dataset api #334

Conversation

chrisiacovella commented Jan 17, 2025

Pull Request Summary

Key changes

Questions

Associated Issue(s)

Pull Request Checklist

codecov-commenter commented Jan 17, 2025 • edited Loading

Codecov Report

chrisiacovella commented Jan 21, 2025

codecov-commenter commented Jan 17, 2025 •

edited

Loading