Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Auto-sklearn baseline v1 #32

Open
wants to merge 66 commits into
base: main
Choose a base branch
from
Open
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
66 commits
Select commit Hold shift + click to select a range
107714e
create requirements.txt
MorrisNein Feb 26, 2023
e67bde8
move to FEDOT 0.7.0
MorrisNein Feb 26, 2023
2458654
create Dockerfile
MorrisNein Feb 26, 2023
e8fee30
prepare experiment demo
MorrisNein Feb 26, 2023
5fb00f0
adapt to FEDOT 0.7.0 again
MorrisNein Mar 3, 2023
310a578
fix similarity assessors
MorrisNein Mar 3, 2023
ae9c909
allow PymfeExtractor fill values with median
MorrisNein Mar 4, 2023
60dc77a
use FEDOT version with fixed initial assumptions
MorrisNein Mar 24, 2023
cf25066
optional cache usage for MFE extractor
MorrisNein Mar 30, 2023
a5a0c8a
allow to advise only the n best models
MorrisNein Mar 30, 2023
3bfaf50
finalize experiment
MorrisNein Mar 30, 2023
75ea275
finalize experiment [2]
MorrisNein Apr 7, 2023
8f29cf7
wrap & log exceptions; log progress to file
MorrisNein Apr 8, 2023
168a4dd
update requirements.txt
MorrisNein Apr 8, 2023
1ae8511
update timeouts
MorrisNein Apr 8, 2023
cc71c47
remove GOLEM from requirements.txt to inherit version required by FEDOT
MorrisNein Apr 18, 2023
1e7be91
clean openml cache
MorrisNein Apr 18, 2023
a10174c
update Dockerfile
MorrisNein Apr 18, 2023
a309eef
make experiment safer
MorrisNein Apr 20, 2023
066cd3e
add .dockerignore
MorrisNein Apr 20, 2023
69b4915
fix save path
MorrisNein Apr 20, 2023
b490f05
Making code more reusable and qualitative
AxiomAlive May 2, 2023
e7e4bf8
Adding auto-sklearn run script with an example
AxiomAlive May 2, 2023
1dead9c
Merge branch 'dont_download_cached_dataset_qualities' of github.com:I…
AxiomAlive May 2, 2023
7f74e70
move to FEDOT 0.7.0
MorrisNein Feb 26, 2023
94e0afa
create Dockerfile
MorrisNein Feb 26, 2023
d24247f
prepare experiment demo
MorrisNein Feb 26, 2023
c4b3f91
fix similarity assessors
MorrisNein Mar 3, 2023
e0661f3
allow PymfeExtractor fill values with median
MorrisNein Mar 4, 2023
4f10b03
use FEDOT version with fixed initial assumptions
MorrisNein Mar 24, 2023
a78be30
optional cache usage for MFE extractor
MorrisNein Mar 30, 2023
9bf6d97
allow to advise only the n best models
MorrisNein Mar 30, 2023
fdee481
finalize experiment
MorrisNein Mar 30, 2023
169ab3e
finalize experiment [2]
MorrisNein Apr 7, 2023
1270d80
wrap & log exceptions; log progress to file
MorrisNein Apr 8, 2023
a796ea7
update timeouts
MorrisNein Apr 8, 2023
8665204
remove GOLEM from requirements.txt to inherit version required by FEDOT
MorrisNein Apr 18, 2023
0f5ac53
clean openml cache
MorrisNein Apr 18, 2023
6eddbb1
update Dockerfile
MorrisNein Apr 18, 2023
d8bd536
make experiment safer
MorrisNein Apr 20, 2023
36c1d01
add .dockerignore
MorrisNein Apr 20, 2023
29b8cb9
fix save path
MorrisNein Apr 20, 2023
e7b7861
Merging remote
AxiomAlive May 15, 2023
fbe04ea
Resolving conflict
AxiomAlive May 15, 2023
ac060ee
add logging in PymfeExtractor
MorrisNein May 17, 2023
7c42e79
add intelligent datasets train/test split
MorrisNein May 31, 2023
cb11a3c
Refactor data storage (#15)
MorrisNein Jun 30, 2023
0b9ed49
Auto-sklearn baseline in a progress
AxiomAlive Jul 3, 2023
42e343b
WIP: auto-sklearn baseline
AxiomAlive Jul 3, 2023
6c5e4b8
examples/4_advising_models conflict resolving
AxiomAlive Jul 3, 2023
26d57b8
Implemented Auto-sklearn baseline.
AxiomAlive Jul 5, 2023
5c10658
fix inner components
MorrisNein Jul 6, 2023
e2c1b89
separate framework cache from other data
MorrisNein Jul 6, 2023
20fb439
use yaml config for the experiment
MorrisNein Jul 6, 2023
d4d50ce
refactor run.py
MorrisNein Jul 6, 2023
e581c9e
update requirements
MorrisNein Jul 7, 2023
2f8b409
Removing IDE configuration files.
AxiomAlive Jul 8, 2023
fc105d2
Conflict resolving
AxiomAlive Jul 8, 2023
67812b7
make absolute path to config.yaml
MorrisNein Jul 16, 2023
4a0b144
fix train test split
MorrisNein Jul 16, 2023
44857b0
refactor for frequent results saving
MorrisNein Jul 16, 2023
68a2443
fix logging
MorrisNein Jul 16, 2023
b4c714f
Adding an AutoML baseline class
AxiomAlive Jul 19, 2023
645a98f
Reflecting API changes in an asklearn baseline
AxiomAlive Jul 19, 2023
6b91ca9
Merge pull request #37 from ITMO-NSS-team/automl_baseline
AxiomAlive Jul 19, 2023
359a4ce
Merge branch 'docker_and_experiments' into auto-sklearn_baseline
AxiomAlive Jul 19, 2023
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
13 changes: 13 additions & 0 deletions .dockerignore
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
# Config & info files
.pep8speaks.yml
Dockerfile
LICENSE
README.md

# Unnecessary files
examples
notebooks
test

# User data
data/cache
4 changes: 3 additions & 1 deletion .gitignore
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
.idea

# Byte-compiled / optimized / DLL files
__pycache__/
*.py[cod]
Expand Down Expand Up @@ -129,4 +131,4 @@ dmypy.json
.pyre/

# User data
data/
/data/cache
30 changes: 30 additions & 0 deletions Dockerfile
Original file line number Diff line number Diff line change
@@ -0,0 +1,30 @@
# Download base image ubuntu 20.04
FROM ubuntu:20.04

# For apt to be noninteractive
ENV DEBIAN_FRONTEND noninteractive
ENV DEBCONF_NONINTERACTIVE_SEEN true

# Preseed tzdata, update package index, upgrade packages and install needed software
RUN truncate -s0 /tmp/preseed.cfg; \
echo "tzdata tzdata/Areas select Europe" >> /tmp/preseed.cfg; \
echo "tzdata tzdata/Zones/Europe select Berlin" >> /tmp/preseed.cfg; \
debconf-set-selections /tmp/preseed.cfg && \
rm -f /etc/timezone /etc/localtime && \
apt-get update && \
apt-get install -y nano && \
apt-get install -y mc && \
apt-get install -y python3.9 python3-pip && \
apt-get install -y git && \
rm -rf /var/lib/apt/lists/*

# Set the workdir
ENV WORKDIR /home/meta-automl-research
WORKDIR $WORKDIR
COPY . $WORKDIR

RUN pip3 install pip && \
pip install wheel && \
pip install --trusted-host pypi.python.org -r ${WORKDIR}/requirements.txt

ENV PYTHONPATH $WORKDIR
Empty file added baselines/__init__.py
Empty file.
Empty file.
166 changes: 166 additions & 0 deletions baselines/auto-sklearn/auto-sklearn_baseline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,166 @@
import csv
import time

from typing import Any, Tuple, Dict

import numpy as np
import logging

import autosklearn.classification
import autosklearn.ensembles

from sklearn import model_selection, metrics

from baselines.automl_baseline import AutoMLBaseline
from meta_automl.data_preparation.datasets_loaders import OpenMLDatasetsLoader
from meta_automl.data_preparation.models_loaders import KnowledgeBaseModelsLoader
from autosklearn.classification import AutoSklearnClassifier


class AutoSklearnBaseline(AutoMLBaseline):
def __init__(self, ensemble_type, time_limit):
self.estimator = AutoSklearnClassifier(
ensemble_class=ensemble_type,
time_left_for_this_task=time_limit,
)
self.knowledge_base_loader = KnowledgeBaseModelsLoader()

@staticmethod
def make_quality_metric_estimates(y, predictions, prediction_proba, is_multi_label):
""" Compute roc_auc, f1, accuracy, log_loss and precision scores. """
results = {
'roc_auc': -1 * float(
"{:.3f}".format(
metrics.roc_auc_score(
y,
prediction_proba if is_multi_label else predictions,
multi_class='ovr'
)
)
),
'f1': -1 * float(
"{:.3f}".format(
metrics.f1_score(
y,
predictions,
average='macro' if is_multi_label else 'binary'
)
)
),
'accuracy': -1 * float(
"{:.3f}".format(
metrics.accuracy_score(
y,
predictions
)
)
),
'logloss': float(
"{:.3f}".format(
metrics.log_loss(
y,
prediction_proba if is_multi_label else predictions
)
)
),
'precision': -1 * float(
"{:.3f}".format(
metrics.precision_score(
y,
predictions,
average='macro' if is_multi_label else 'binary',
labels=np.unique(predictions)
)
)
)
}
return results

def run(self):
""" Fit auto-sklearn meta-optimizer to knowledge base datasets and output a single best model. """
dataset_ids_to_load = [
dataset_id for dataset_id in self.knowledge_base_loader
.parse_datasets('test')
.loc[:, 'dataset_id']
]
# dataset_ids_to_load = [dataset_ids_to_load[dataset_ids_to_load.index(41166)]]
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Можно убрать комментарий


loaded_datasets = OpenMLDatasetsLoader().load(dataset_ids_to_load)

for iteration, dataset in enumerate(loaded_datasets):
logging.log(logging.INFO, f"Loaded dataset name: {dataset.name}")
dataset_data = dataset.get_data()

X_train, X_test, y_train, y_test = model_selection.train_test_split(
dataset_data.x,
dataset_data.y,
test_size=0.2,
random_state=42,
stratify=dataset_data.y
)

fitting_start_time = time.time()
ensemble = self.estimator.fit(X_train, y_train)
fitting_time = time.time() - fitting_start_time
logging.log(logging.INFO, f"Fitting time is {fitting_time}sec")

inference_start_time = time.time()
predicted_results = self.estimator.predict(X_test)
inference_time = time.time() - inference_start_time
logging.log(logging.INFO, f"Inference time is {inference_time}sec")

predicted_probabilities = self.estimator.predict_proba(X_test)

best_single_model = list(ensemble.show_models().values())[0].get('sklearn_classifier')

# autosklearn_ensemble = pipeline.show_models()
# formatted_ensemble = {
# model_id: {
# 'rank': autosklearn_ensemble[model_id].get('rank'),
# 'cost': float(f"{autosklearn_ensemble[model_id].get('cost'):.3f}"),
# 'ensemble_weight': autosklearn_ensemble[model_id].get('ensemble_weight'),
# 'model': autosklearn_ensemble[model_id].get('sklearn_classifier')
# } for model_id in autosklearn_ensemble.keys()
# }
Comment on lines +116 to +124
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Комментарии с кодом лучше не добавлять в git, если они не служат для понимания остального кода. Можно сделать stash этих изменений, если есть необходимость их сохранить

Аналогично для остального закомментированного кода


general_run_info = {
'dataset_id': dataset.id_,
'dataset_name': dataset.name,
'run_label': 'Auto-sklearn',
}

is_multilabel_classification = True if len(set(predicted_results)) > 2 else False
quality_metric_estimates = AutoSklearnBaseline.make_quality_metric_estimates(
y_test,
predicted_results,
predicted_probabilities,
is_multilabel_classification
)

model_dependent_run_info = {
'fit_time': float(f'{fitting_time:.1f}'),
'inference_time': float(f'{inference_time:.1f}'),
'model_str': repr(best_single_model)
}

results = {**general_run_info, **quality_metric_estimates, **model_dependent_run_info}

# for key in autosklearn_ensemble.keys():
# ensemble_model = autosklearn_ensemble[key]
# formatted_ensemble = results['ensemble']
# for model_id in formatted_ensemble.keys():
# formatted_ensemble[model_id] = ensemble_model.get("rank", None)

AutoSklearnBaseline.save_on_disk(results.valuess())

return results

@staticmethod
def save_on_disk(data):
with open('data/experimental_data.csv', 'a', newline='') as file:
writer = csv.writer(file, delimiter=',')
writer.writerow(data)


if __name__ == '__main__':
AutoSklearnBaseline(autosklearn.ensembles.SingleBest, 600).run()
57 changes: 57 additions & 0 deletions baselines/auto-sklearn/data/experimental_data.csv
Original file line number Diff line number Diff line change
@@ -0,0 +1,57 @@
1461,bank-marketing,Auto-sklearn,-0.711,-0.535,-0.907,3.34,-0.648,598.0,0.1,"HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=1.7108930238344161e-10,
learning_rate=0.010827728124541558, loss='auto',
max_iter=512, max_leaf_nodes=25,
min_samples_leaf=4, n_iter_no_change=19,
random_state=1,
validation_fraction=0.1759114608225653,
warm_start=True)"
179,adult,Auto-sklearn,-0.774,-0.91,-0.859,5.077,-0.885,595.3,0.1,"HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=1.7108930238344161e-10,
learning_rate=0.010827728124541558, loss='auto',
max_iter=512, max_leaf_nodes=25,
min_samples_leaf=4, n_iter_no_change=19,
random_state=1,
validation_fraction=0.1759114608225653,
warm_start=True)"
1464,blood-transfusion-service-center,Auto-sklearn,-0.669,-0.5,-0.8,7.209,-0.625,597.6,0.0,"PassiveAggressiveClassifier(C=0.253246830865058, average=True, max_iter=16,
random_state=1, tol=0.01676578241454229,
warm_start=True)"
991,car,Auto-sklearn,-1.0,-1.0,-1.0,0.0,-1.0,596.8,0.0,"HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=1.9280388598217333e-10,
learning_rate=0.24233932723531437, loss='auto',
max_iter=128, max_leaf_nodes=35,
min_samples_leaf=17, n_iter_no_change=1,
random_state=1, validation_fraction=None,
warm_start=True)"
1489,phoneme,Auto-sklearn,-0.848,-0.797,-0.887,4.068,-0.845,600.4,0.1,"AdaBoostClassifier(algorithm='SAMME',
base_estimator=DecisionTreeClassifier(max_depth=10),
learning_rate=1.1377640450285444, n_estimators=352,
random_state=1)"
41027,jungle_chess_2pcs_raw_endgame_complete,Auto-sklearn,-0.975,-0.816,-0.865,0.271,-0.824,595.1,0.2,"HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=9.674948183980905e-09,
learning_rate=0.014247987845444413, loss='auto',
max_iter=512, max_leaf_nodes=55,
min_samples_leaf=164, n_iter_no_change=1,
random_state=1,
validation_fraction=0.11770489601182355,
warm_start=True)"
41166,volkert,Auto-sklearn,-0.874,-0.586,-0.644,1.829,-0.587,595.8,0.3,"LinearDiscriminantAnalysis(shrinkage='auto', solver='lsqr',
tol=0.018821286956948503)"
54,vehicle,Auto-sklearn,-0.964,-0.86,-0.859,0.408,-0.861,595.5,0.0,"MLPClassifier(activation='tanh', alpha=0.0002060405669905105, beta_1=0.999,
beta_2=0.9, hidden_layer_sizes=(87, 87, 87),
learning_rate_init=0.00040205833939989724, max_iter=256,
n_iter_no_change=32, random_state=1, validation_fraction=0.0,
verbose=0, warm_start=True)"
40996,fashion-mnist,Auto-sklearn,-0.968,-0.864,-0.865,1.913,-0.866,296.1,1.2,"KNeighborsClassifier(n_neighbors=4, weights='distance')"
40996,fashion-mnist,Auto-sklearn,-0.968,-0.864,-0.865,1.913,-0.866,595.5,0.8,"KNeighborsClassifier(n_neighbors=4, weights='distance')"
42344,sf-police-incidents,Auto-sklearn,-0.574,-0.589,-0.574,15.367,-0.569,594.8,0.5,"HistGradientBoostingClassifier(early_stopping=True,
l2_regularization=3.609412172481434e-10,
learning_rate=0.05972079854295879, loss='auto',
max_iter=512, max_leaf_nodes=4,
min_samples_leaf=2, n_iter_no_change=14,
random_state=1, validation_fraction=None,
warm_start=True)"
1240,airlinescodrnaadult,Auto-sklearn,-0.62,-0.683,-0.631,13.306,-0.658,594.3,0.1,"SGDClassifier(alpha=1.6992296128865824e-07, average=True, eta0=0.01, loss='log',
max_iter=512, penalty='l1', random_state=1,
tol=1.535384699341134e-05, warm_start=True)"
11 changes: 11 additions & 0 deletions baselines/automl_baseline.py
Original file line number Diff line number Diff line change
@@ -0,0 +1,11 @@
from abc import ABC


class AutoMLBaseline(ABC):
def run(self):
raise NotImplementedError

@staticmethod
def save_on_disk(data):
raise NotImplementedError

5 changes: 2 additions & 3 deletions examples/0_loading_data/load_list_of_datasests.py
Original file line number Diff line number Diff line change
Expand Up @@ -6,9 +6,8 @@ def get_datasets():
'nomao', 'sylvine', 'kc1', 'jungle_chess_2pcs_raw_endgame_complete', 'credit-g', 'delta_ailerons', 'pol'
]
datasets_loader = OpenMLDatasetsLoader()
datasets = datasets_loader.load(dataset_names)
print(f'Datasets "{", ".join(dataset_names)}" are available at the paths:')
print('\n'.join(str(d) for d in datasets))
datasets = datasets_loader.load(dataset_names, allow_names=True)
print(f'Datasets "{", ".join(dataset_names)}" are downloaded.')
return datasets


Expand Down
Original file line number Diff line number Diff line change
@@ -1,3 +1,5 @@
import openml

from meta_automl.data_preparation.datasets_loaders import OpenMLDatasetsLoader
from meta_automl.data_preparation.meta_features_extractors import PymfeExtractor

Expand All @@ -6,8 +8,9 @@ def main():
dataset_names = [
'nomao', 'sylvine'
]
dataset_ids = [openml.datasets.get_dataset(name, download_data=False, download_qualities=False).dataset_id for name in dataset_names]
extractor = PymfeExtractor(extractor_params={'groups': 'general'}, datasets_loader=OpenMLDatasetsLoader())
meta_features = extractor.extract(dataset_names)
meta_features = extractor.extract(dataset_ids)
return meta_features


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -9,8 +9,8 @@ def main():
loader = OpenMLDatasetsLoader()
extractor = PymfeExtractor(extractor_params={'groups': 'general'})

cached_datasets = loader.load(dataset_names)
meta_features = extractor.extract(cached_datasets)
datasets = loader.load(dataset_names, allow_names=True)
meta_features = extractor.extract(datasets)
return meta_features


Expand Down
Original file line number Diff line number Diff line change
Expand Up @@ -2,24 +2,25 @@

from meta_automl.data_preparation.datasets_loaders import OpenMLDatasetsLoader
from meta_automl.data_preparation.meta_features_extractors import PymfeExtractor
from meta_automl.meta_algorithm.datasets_similarity_assessors import KNNSimilarityAssessor
from meta_automl.meta_algorithm.datasets_similarity_assessors import KNeighborsBasedSimilarityAssessor


def main():
# Define datasets.
dataset_names = ['monks-problems-1', 'apsfailure', 'australian', 'bank-marketing']
datasets = OpenMLDatasetsLoader().load(dataset_names, allow_names=True)
# Extract meta-features and load on demand.
extractor = PymfeExtractor(extractor_params={'groups': 'general'}, datasets_loader=OpenMLDatasetsLoader())
meta_features = extractor.extract(dataset_names)
extractor = PymfeExtractor(extractor_params={'groups': 'general'})
meta_features = extractor.extract(datasets)
# Preprocess meta-features, as KNN does not support NaNs.
meta_features = meta_features.dropna(axis=1, how='any')
# Split datasets to train (preprocessing) and test (actual meta-algorithm objects).
x_train, x_test = train_test_split(meta_features, train_size=0.75, random_state=42)
y_train = x_train.index
assessor = KNNSimilarityAssessor({'n_neighbors': 1}, n_best=2)
assessor = KNeighborsBasedSimilarityAssessor(n_neighbors=3)
assessor.fit(x_train, y_train)
# Get models for the best fitting datasets from train.
return x_test.index, assessor.predict(x_test)
return x_test.index, assessor.predict(x_test, return_distance=True)


if __name__ == '__main__':
Expand Down
Loading