To install this repo as Python lib just run the following command:
pip install git+https://github.com/VolodymyrVozniak/universal-trainer
torch>=1.13.0
scikit-learn>=1.2.0
pandas>=1.5.2
plotly>=5.7.0
optuna>=2.10.0
- For binary problem check this tutorial
- For regression problem check this tutorial
- For multiclassification problem check this tutorial
There are 3 main classes for preprocessing:
BinaryPreproc
- for preprocessing binary data;RegressionPreproc
- for preprocessing regression data;MulticlassPreproc
- for preprocessing multiclassification data.
Preprocessing pipeline
[MAIN]
- Prepare data for initializing preproc class.
- You will need a dictionary with unique ids as keys (unique ids can be just unique int numbers defined by
np.arange()
function) and features as values. You can somehow preprocess these features (one entry at a time) in Dataset class when train model, but if you want to use defaultCroatoanDataset
, already define your feautures as final lists for each entry). - Also, you will need a dictionary with unique ids as keys (these unique ids must match ids defined for features, meaning target with specific unique id must match features with this specific id) and targets as values.
- You will need a dictionary with unique ids as keys (unique ids can be just unique int numbers defined by
- Prepare targets for training.
- Plot input targets histogram (distribution) and define if we need to reverse targets (for binary problems) or log data and cut tails by quantiles (for regression problems).
- Prepare targets with already defined arguments.
- Plot prepared targets histogram (distribution) to check the difference and correctness.
- Split data.
- Split data with random splitting type.
- For each splitting type you can get main info and plot targets histograms (distributions) for all sets and folds.
[EXTRA]
- Oversampling (for binary and multiclassification problems only)
- Oversample each class label to reach
min_count
by adding extra ids toself.split
for train.
- Oversample each class label to reach
- Feature scaling.
- Scale features using scaler from sklearn (fit scaler on train data got from splitting, transform all features using this scaler and save the scaler to class attribute).
Examples
- Binary problem
import numpy as np
from sklearn.datasets import load_breast_cancer
from croatoan_trainer.preprocess import BinaryPreproc
# Load example data
data = load_breast_cancer()
x = data['data']
y = data['target']
# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))
# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))
# Initialize preproc class
preproc = BinaryPreproc(ids_to_features, ids_to_targets)
# Plot input targets histogram
preproc.plot_targets(prepared=False)
# Define if we need to reverse our targets
preproc.prepare_targets(reverse=True)
# Plot prepared targets histogram
preproc.plot_targets(prepared=True)
# Split data
preproc.random_split(
test_size=0.2,
n_folds=5,
val_size=None,
seed=51983
)
# Plot input targets histograms
preproc.plot_split_targets(prepared=False)
# Plot prepared targets histograms
preproc.plot_split_targets(prepared=True)
# Get info about splitting
split_info = preproc.get_split_info()
# Scale features
preproc.scale_features("Standard")
For more details check tutorial
- Regression problem
import numpy as np
from sklearn.datasets import load_diabetes
from croatoan_trainer.preprocess import RegressionPreproc
# Load example data
data = load_diabetes()
x = data['data']
y = data['target']
# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))
# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))
# Initialize preproc class
preproc = RegressionPreproc(ids_to_features, ids_to_targets)
# Plot input targets histogram
preproc.plot_targets(prepared=False)
# Define if we need to reverse our targets
preproc.prepare_targets(log=False, quantiles=None)
# Plot prepared targets histogram
preproc.plot_targets(prepared=True)
# Split data
preproc.random_split(
test_size=0.2,
n_folds=5,
val_size=None,
seed=51983
)
# Plot input targets histograms
preproc.plot_split_targets(prepared=False)
# Plot prepared targets histograms
preproc.plot_split_targets(prepared=True)
# Get info about splitting
split_info = preproc.get_split_info()
For more details check tutorial
- Multiclassification problem
import numpy as np
from sklearn.datasets import load_iris
from croatoan_trainer.preprocess import MulticlassPreproc
# Load example data
data = load_iris()
x = data['data']
y = data['target']
# Make dict with unique ids as keys and features as values
ids_to_features = dict(zip(np.arange(len(y)), x))
# Make dict with unique ids as keys and targets as values
ids_to_targets = dict(zip(np.arange(len(y)), y))
# Initialize preproc class
preproc = MulticlassPreproc(ids_to_features, ids_to_targets)
# Plot input targets histogram
preproc.plot_targets(prepared=False)
# Define if we need to reverse our targets
preproc.prepare_targets()
# Plot prepared targets histogram
preproc.plot_targets(prepared=True)
# Split data
preproc.random_split(
test_size=0.2,
n_folds=5,
val_size=None,
seed=51983
)
# Plot input targets histograms
preproc.plot_split_targets(prepared=False)
# Plot prepared targets histograms
preproc.plot_split_targets(prepared=True)
# Get info about splitting
split_info = preproc.get_split_info()
For more details check tutorial
There is 1 main class for training:
Trainer
- train binary, regression or multiclassification problem.
Training pipeline
- Trains in CV mode (meaning trains model on train set of specific fold and checks model performance on val set of specific fold with passed value for epochs and gets avarage performance on each epoch by avaraging scores for all folds), chooses best epoch and saves all results (losses, metrics on each epoch for train and val sets, best result, training time, unique ids, true values and predicted values on each epoch for val set). Results on each fold are also saved.
- Trains in test mode (meaning trains model on train set and checks model performance on test set with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set).
- Trains in final mode (meaning trains model on all data with chosen number of epochs on the CV stage) and saves all results (losses, metrics on each epoch for train and test sets, best result, training time, unique ids, true values and predicted values on each epoch for test set). Here train and test are the same: all possible data, but the metrics can differ, because train set is always shuffled, while test set isn't. You can skip this step py passing
include_final=False
when call thetrain()
method.
Examples
- Binary problem
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import BinarySimpleMLP
from croatoan_trainer.train.metrics import get_metrics_binary
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=BinarySimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.BCELoss(),
get_metrics=get_metrics_binary,
main_metric="f1",
direction="maximize",
include_compile=False
)
params = {
"model": {
"in_features": x.shape[1],
"hidden_features": 20,
"dropout": 0.25
},
"optimizer": {
"lr": 1e-3,
"weight_decay": 5*1e-5
},
"batch_size": 32,
}
results, model_weights = trainer.train(
params=params,
epochs=100,
inlcude_final=True,
include_epochs_pred=True
)
For more details check tutorial
- Regression problem
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import RegressionSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_regression
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=RegressionSimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.MSELoss(),
get_metrics=get_metrics_regression,
main_metric="r2",
direction="maximize",
include_compile=False
)
params = {
"model": {
"in_features": x.shape[1],
"hidden_features": 20,
"dropout": 0.25
},
"optimizer": {
"lr": 1e-3,
"weight_decay": 5*1e-5
},
"batch_size": 32,
}
results, model_weights = trainer.train(
params=params,
epochs=100,
inlcude_final=True,
include_epochs_pred=True
)
For more details check tutorial
- Multiclassification problem
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import MulticlassSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_multiclass
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=MulticlassSimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.CrossEntropyLoss(),
get_metrics=get_metrics_multiclass,
main_metric="f1",
direction="maximize",
include_compile=False
)
params = {
"model": {
"in_features": x.shape[1],
"hidden_features": 20,
"output_features": 3,
"dropout": 0.25
},
"optimizer": {
"lr": 1e-3,
"weight_decay": 5*1e-5
},
"batch_size": 32,
}
results, model_weights = trainer.train(
params=params,
epochs=100,
inlcude_final=True,
include_epochs_pred=True
)
For more details check tutorial
There are 3 main classes for defining tuner:
TPETuner
- for tuning parameters using TPE (Tree-structured Parzen Estimator) algorithm (optuna default). For more details check this link;RandomTuner
- for tuning parameters using random sampling. For more details check this link;GridTuner
- for tuning parameters using grid search. For more details check this link.
Tuning pipeline
- Initialize tuner class with params for tuning.
- Tune parameters using trainer class and
tune()
method. - Get best params and pass them to
train()
method.
Examples
- Binary problem + TPETuner (with function params)
import optuna
import torch
import torch.nn as nn
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.tune import TPETuner
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.metrics import get_metrics_binary
class CustomModel(nn.Module):
def __init__(self, **kwargs):
super(CustomModel, self).__init__()
in_features = kwargs["in_features"]
fc_layers = [nn.BatchNorm1d(in_features)]
activation = getattr(nn, kwargs["activation"])()
for i in range(kwargs["n_layers"]):
out_features = kwargs[f"n_units_l{i}"]
fc_layers.append(nn.Linear(in_features, out_features))
fc_layers.append(activation)
fc_layers.append(nn.Dropout(kwargs[f"dropout_l{i}"]))
in_features = out_features
fc_layers.append(nn.Linear(in_features, 1))
fc_layers.append(nn.Sigmoid())
self.layers = nn.Sequential(*fc_layers)
def forward(self, data: torch.Tensor) -> torch.Tensor:
return self.layers(data).reshape(-1)
def get_tune_params(trial: optuna.trial.Trial):
model_params, optimizer_params = {}, {}
model_params['in_features'] = 512 # some constant value
model_params['activation'] = trial.suggest_categorical(
'activation', ['ReLU', 'GELU', 'ELU', 'LeakyReLU']
)
n_layers = trial.suggest_int('n_layers', 2, 4)
model_params['n_layers'] = n_layers
for i in range(n_layers):
n_units = trial.suggest_categorical(
f'n_units_l{i}', (512, 1024, 2048)
)
dropout = trial.suggest_float(
f'dropout_l{i}', 0.1, 0.5
)
model_params[f'n_units_l{i}'] = n_units
model_params[f'dropout_l{i}'] = dropout
optimizer_params['lr'] = trial.suggest_float(
'lr', 1e-5, 1e-1, log=True
)
optimizer_params['weight_decay'] = trial.suggest_float(
'weight_decay', 5e-5, 5e-3, log=True
)
batch_size = trial.suggest_categorical(
"batch_size", (16, 32, 64, 128, 256, 512, 1024, 2048)
)
return model_params, optimizer_params, batch_size
tuner = TPETuner(
params=get_tune_params,
storage=None,
study_name="binary",
direction="maximize",
load_if_exists=False
)
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=BinarySimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.BCELoss(),
get_metrics=get_metrics_binary,
main_metric="f1",
direction="maximize",
include_compile=False
)
params = trainer.tune(
tuner=tuner,
epochs=20,
n_trials=2,
timeout=None,
catch=(),
early_stopping_rounds=None
)
results, model_weights = trainer.train(
params=params,
epochs=100,
include_final=True,
include_epochs_pred=True
)
For more details check tutorial
- Regression problem + RandomTuner (with dict params)
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.tune import RandomTuner
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import RegressionSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_regression
tune_params = {
"model": {
"in_features": ("constant", x.shape[1]),
"hidden_features": ("int", (18, 22, 2, False)),
"dropout": ("float", (0.1, 0.5, False))
},
"optimizer": {
"lr": ("constant", 1e-3),
"weight_decay": ("constant", 5*1e-5)
},
"batch_size": ("categorical", (32, 64))
}
tuner = RandomTuner(
params=tune_params,
storage=None,
study_name="regression",
direction="minimize",
load_if_exists=False
)
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=RegressionSimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.MSELoss(),
get_metrics=get_metrics_regression,
main_metric="mae",
direction="minimize",
include_compile=False
)
params = trainer.tune(
tuner=tuner,
epochs=20,
n_trials=2,
timeout=None,
catch=(),
early_stopping_rounds=None
)
results, model_weights = trainer.train(
params=params,
epochs=100,
include_final=True,
include_epochs_pred=True
)
For more details check tutorial
- Multiclassification problem + GridTuner (with dict params)
import torch
from torch.utils.data import DataLoader
from torch.optim import Adam
from croatoan_trainer.tune import GridTuner
from croatoan_trainer.train import Trainer
from croatoan_trainer.train.dataset import CroatoanDataset
from croatoan_trainer.train.model import MulticlassSimpleMLP
from croatoan_trainer.train.metrics import get_metrics_multiclass
tune_params = {
"model": {
"in_features": ("constant", x.shape[1]),
"hidden_features": ("categorical", (18, 20, 22)),
"output_features": ("constant", 3),
"dropout": ("categorical", (0.1, 0.25, 0.5)),
},
"optimizer": {
"lr": ("constant", 1e-3),
"weight_decay": ("constant", 5*1e-5)
},
"batch_size": ("categorical", (32, 64))
}
tuner = GridTuner(
params=tune_params,
storage=None,
study_name="multiclass",
direction="maximize",
load_if_exists=False
)
trainer = Trainer(
preprocessed_data=preproc,
dataset_class=CroatoanDataset,
loader_class=DataLoader,
model_class=MulticlassSimpleMLP,
optimizer_class=Adam,
criterion=torch.nn.CrossEntropyLoss(),
get_metrics=get_metrics_multiclass,
main_metric="f1",
direction="maximize",
include_compile=False
)
params = trainer.tune(
tuner=tuner,
epochs=20,
n_trials=2,
timeout=None,
catch=(),
early_stopping_rounds=None
)
results, model_weights = trainer.train(
params=params,
epochs=100,
include_final=True,
include_epochs_pred=True
)
For more details check tutorial
There are 3 main classes for analyzing:
BinaryAnalyzer
- for analyzing binary problem's training results;RegressionAnalyzer
- for analyzing regression problem's training results;MulticlassAnalyzer
- for analyzing multiclassification problem's training results.
Analyzing pipeline
- Initialize analyze class with results got after training.
- Plot different charts and analyze results.
- Get dataframe with final metrics for each stage (
cv
,test
orfinal
).
REMINDER! The main stage is always test
, not final
(test
is how your model performs on data that it didn't see; final
is how your model performs on data that it used for training).
Examples
- Binary problem
from croatoan_trainer.analyze import BinaryAnalyzer
analyzer = BinaryAnalyzer(results)
analyzer.plot_all("cv")
analyzer.plot_all("test")
analyzer.plot_all("final")
metrics = analyzer.get_df_metrics()
For more details check tutorial
- Regression problem
from croatoan_trainer.analyze import RegressionAnalyzer
analyzer = RegressionAnalyzer(results)
analyzer.plot_all("cv")
analyzer.plot_all("test")
analyzer.plot_all("final")
metrics = analyzer.get_df_metrics()
For more details check tutorial
- Multiclassification problem
from croatoan_trainer.analyze import MulticlassAnalyzer
def postprocess_fn(model_output):
return np.argmax(model_output, axis=1)
analyzer = MulticlassAnalyzer(results, postprocess_fn)
analyzer.plot_all("cv")
analyzer.plot_all("test")
analyzer.plot_all("final")
metrics = analyzer.get_df_metrics()
For more details check tutorial