Database format #7

shyuep · 2020-03-10T19:21:18Z

A database is a good idea. But I think we should try to use something widely supported. We can even support a few options. Any recommendations? The obvious ones are hdf5 and json and mysql. MongoDB is probably too heavy duty, though it can be an option since the translation to json is easy.

chc273 · 2020-03-10T23:46:36Z

For model saving, currently we use json, hdf5 and pickle.

Json is mainly for saving the configurational parameters, which can be obtained by as_dict or get_params (sklearn method).

pickle and hdf5 are used to save the states of the model. The states are for example model weights that are not provided in __init__, but rather computed using training data. So far, we support two types of models/packages, namely sklearn and keras. For sklearn models, the official weight-saving method is using pickle, and the sklearn model provides __getstate__ and __setstate__ API for working with pickle format. keras/tensorflow deep learning models, on the other hand use hdf5. If more model types/packages (e.g., lightgbm, xgboost, pytorch) will be used, I think pickle and hdf5 may be adapted to work with them. We will find out as we go further.

The database part is used to ease the model training process and increase the reproducibility of the models. So far, I think MongoDB is better suited for this task compared to mysql, since any data or model results will be highly heterogeneous. We also have prior experience with MongoDB. Unless we find something better, we can use MongoDB for now. This is not a core function to maml though. It is more of a tool to help curating data, building/saving model and storing model results.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Database format #7

Database format #7

shyuep commented Mar 10, 2020

chc273 commented Mar 10, 2020

Database format #7

Database format #7

Comments

shyuep commented Mar 10, 2020

chc273 commented Mar 10, 2020