ML Tracking Ops represents an MLOps Python library/platform which can be used for tracking machine learning projects. This platform enables users to track distinct training runs and complete hyperparameter sweeps.
- Exposes an API to the user which enables them to log Machine Learning/Data Science metrics during training
- Enables users to initiate a hyperparameter sweep and log the sweep artifacts
- Enables users to start an interactive web app for visualizing the experiment results. In this app they can compare different experiments and visualize different metrics
- It enables users to compare different training runs executed within the same hyperparameter sweep
- Simplest form of tracking runs
- Hyperparameter Sweeps
- ML Tracking Ops Web App
- An Important Note
- Licence
Below we can see a PyTorch example of how we can track an experiment using ML-Tracking-Ops.
ML Tracking Ops is library agnostic, i.e. you do not have to use PyTorch. As long as the ExperimentLogger.add_scalar
is provided with a simple float
the experiment logging process will be possible.
from ml_tracking_ops.experiment.logger import ExperimentLogger
...
# Dataset setup, model instantiation etc.
...
writer = ExperimentLogger(logdir="runs")
max_epochs = 10
for epoch in range(max_epochs):
print("Epoch:", epoch)
for x, y_true in dataloader:
train_step += 1
optimizer.zero_grad()
y_pred = model(x)
loss = loss_fcn(y_pred, y_true)
loss.backward()
optimizer.step()
# We need to pass the scalar value in the form of a simple 'float'
writer.add_scalar("Loss", loss.item(), train_step)
When an instance of ExperimentLogger
is created a directory with the name corresponding to the argument logdir
is created (if it didn't previously exist). In this logdir
directory a new directory gets created which corresponds to the time the instance of ExperimentLogger
was created. This directory contains logs related to the training run started at the time indicated by the directory name. Each of these directories contains a single .dat
file which contains time-series logs for each metric logged during that particular training run. See image below for an example.
Each of these folders represents a different training run (possibly after changing some hyperparameters). This logdir
directory should be used to group different training runs so they can be easily compared by using the ML Tracking Ops web app
ML Tracking Ops enables users to run a hyperparameter sweep for their machine learning pipeline. This is relatively easy to do since all you need is defining a simple configuration file and an argument parser. After defining those two things we can start the hyperparameter sweep with a simple command
ml-tracking-ops --run_sweep=True --logdir=runs
-
Passing the argument
logdir
is not mandatory since it it will default to the stringruns
. -
When the sweep is started a directory with the name corresponding to the argument
logdir
is created (if it didn't previously exist). In thislogdir
directory a new directory gets created which corresponds to the time the sweep was started. This directory contains aexperiment_description.json
file which is automatically created and describes the configuration of the sweep (this is used by the web app and SHOULD NOT be deleted). -
Besides this file a separate
.dat
file gets created for each hyperparameter combination tried. These files contain time-series logs created by theExperimentLogger
instances created inside of the training script specified in the configuration file every time a new training run is started. This file is named by the timestamp at which the training process for the new hyperparameter combination was started. -
On the other hand specifying the
--run_sweep=True
is necessary since not passing this argument will result in the valueFalse
which would lead to starting the ML Tracking Ops web app
This file is used to explain:
- What hyperparameters you wish to explore and how to sample them
- What is the entry point for training your model
- How many different hyperparameter combinations you wish to try. NOTE: Hyperparameter search is not exhaustive, and is thereby limited by the specified maximum number of training runs.
- Do we wish to apply early stopping to each of the training runs
- If yes, to which metric should we pay attention to when trying to optimize the model
- Is the optimization process maximizing or minimizing the
optimization_metric
?
This file must be named "experiment_cfg.json"
Below we can see an example of the configuration file. The JSON object keys main_script_name
, max_runs
, hyperparameters
and early_stopping
must be present.
{
"main_script_name": "train_script.py",
"hyperparameters": {
"learning_rate": {
"type": "uniform",
"min": 1e-5,
"max": 1e-2
},
"batch_size": {
"type": "choice",
"candidates": [32, 64, 128]
},
"train_steps": {
"type": "choice",
"candidates": [700, 850, 1000]
}
},
"max_runs": 100,
"early_stopping": true,
"early_stopping_patience": 5,
"optimization_metric": "Accuracy",
"optimization_goal": "max"
}
-
In the example above we can see that hyperparameters we wish to explore must be defined in a specific format. Each hyperparameter must have a key
type
which can take values ofuniform
which represents a continuous parameter, orchoice
which represents a discrete parameter. The other keys likemin
,max
,candidates
are required for the according hyperparameter type i.e.min
andmax
are required for usinguniform
sampling andcandidates
is required when using adiscrete
sampling. Hyperparameters can have any name the user wants them to have. Note: these names must match with the expected hyperparameter names in the script specified with themain_script_name.py
. -
We should specify if we wish to apply the EarlyStopping strategy to each of the training runs. If we set the property
early_stopping
totrue
, then we must specify the other properties as well:optimization_metric
The metric which we need to track in order to decide should the EarlyStopping event occurearly_stopping_patience
represents the maximum number of steps (during which the metric was logged) during which the metric specified by theoptimization_metric
parameter is allowed not to improve. When this threshold is reached, EarlyStopping event triggers and the training process (for the current hyperparameter combination) terminates.optimization_goal
This parameter serves as a way to keep track if the metric has improved or not. It can take the values ofmax
andmin
which correspond to maximization and minimization of theoptimization_metric
, respectively.
Note
Both the configuration file experiment_cfg.json
and the training script specified in the main_script_name
must be present in the current working directory where the ml-tracking-ops --run_sweep=True --logdir=runs
command will be run.
In each training run we sample a hyperparameter combination according to the previously specified sampling preferences. After this step your training script specified in the main_script_name
in the experiment_cfg.json
file is started as a separate subprocess and sampled hyperparameters are passed to it in the form of command line arguments.
This means that in order to use the exact sampled values of these hyperparameters we need to have an argument parser instance inside of our training script. This argument parser needs to be able to accept the arguments for which the names are equal to the ones defined in the experiment_cfg.json
in the hyperparameters
section.
Below we can see an example of this argument parser. This parser was designed in order to be able to accept hyperparameters defined in the experiment_cfg.json
example above.
from argparse import ArgumentParser
parser = ArgumentParser()
# Having this argument is really important since you would need to pass this argument when creating ExperimentLogger instance
parser.add_argument("--logdir", type=str, default="runs)
parser.add_argument("--learning_rate", type=float, default=1e-3)
parser.add_argument("--batch_size", type=int, default=32)
parser.add_argument("--train_steps", type=int, default=1000)
# Collect arguments (sampled hyperparameter values) that got passed when the training script was started
config = parser.parse_args()
We can start the ML Tracking Ops web app by running a simple command
ml-tracking-ops --logdir=runs
The logdir
argument represents the directory which contains the experiment and sweep logs which we would like to observe and analyze.
Passing the logdir
argument is optional since not passing it will default to the string runs
but be aware of this behavior since the directory runs
may not contain the logs you are interested in or may not exist at all!
After running the previous command our app starts on a local server 127.0.0.1:5000
or localhost:5000
. Visiting any of these two addresses will result to immediate redirect to a page where different experiment runs are properly visualized. An example of a page you would see when you start the app is given below.
As we can see on the Experiments tab below, the sidebar contains the list of all experiments present in the specified logdir
directory. This does not include logs which correspond to hyperparameter sweeps.
When an experiment is selected all of the metrics which were logged in it's according log file(directory) are displayed on their separate graphs.
As we can see there is also a possibility for us to select multiple experiments at once and compare different experiments. In this case the graphs for different experiments are drawn on top of one another so it would be easier to compare different training runs. We can see below an example of such case.
As we can see on the Sweeps tab below, the sidebar contains the list of all hyperparameter sweeps present in the specified logdir
directory. This does not include logs which correspond to regular training runs which aren't sweeps.
Below we can see an example of how this tab can look like
When a sweep is selected all of the data relevant for that sweep is displayed.
This section describes the content of the experiment_cfg.json
file in a structured and visually appealing way. This section gets automatically created.
This table contains description of every training run started during the sweep. The description consists out of the exact values of hyperparameters which correspond to that particular run and the best value of the metric specified in the optimization_metric
field. If no value was given for that field, this column won't be present in the table.
As we can see below, on this chart we can see the selected metric for every run that is present on the current page of the table. As we can see below EarlyStopping event was triggered for some the runs present on the current page.
- Here is a short demo of usage of Experiments tab
experiments_demo.mp4
- Here is a short demo of usage of Sweeps tab
sweeps_demo.mp4
This tool was created as a part of my learning process and therefore is provided "as is".
Use this tool at your own risk.