Moving away from pickles #18

NeoChaos12 · 2023-01-13T16:33:41Z

Pickling our performance datasets and surrogate models was chosen as a convenient solution during the development phase of the repo, but is inherently problematic from the perspective of long-term maintenance and support of the repo. In particular, pickling makes the shared datasets and models very sensitive to the exact dependency versions and system configuration used when originally pickling them. Case in point, #6. Therefore, a future release should focus on moving away from pickles. Specifically,

Performance Datasets: Relatively straightforward since Pandas DataFrames are quite flexible and support a variety of I/O options. Viable formats include HDF5, Feather and CSV. Each comes with its own set of advantages and disadvantages that need to be carefully weighed. It may be fine to keep using pickles for the interim data (checkpoints, metrics) generated during model training and only switch to a different format for the final dataset.
Surrogate Models: This is more nuanced since it involves a large number of moving parts. A different serialization scheme will need to be chosen based on the interoperability of SciKit-Learn, XGBoost and jahs_bench.surrogates.model.XGBSurrogate. Ultimately, it may be necessary to write a custom JSON encoder for XGBSurrogate that captures all relevant parameters of the trained model object and saves/loads them.

The text was updated successfully, but these errors were encountered:

NeoChaos12 added the enhancement New feature or request label Jan 13, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Moving away from pickles #18

Moving away from pickles #18

NeoChaos12 commented Jan 13, 2023

Moving away from pickles #18

Moving away from pickles #18

Comments

NeoChaos12 commented Jan 13, 2023