-
Notifications
You must be signed in to change notification settings - Fork 4
2. System Architecture
This page gives an overview of the architecture of this project. Having read this you should have a solid basis to understand and contribute to this project. As a supplementary resource you can also read this publication. Most of the screenshots are also taken from there.
The architecture is divided into layers :
- Presentation layer: the user interface (frontend) is based on the Blazor technology, the frontend allows the user to interact with OMA-ML.
- Logic layer: the central component responsible to respond to incoming frontend requests and execute new training sessions using AutoML adapters and the integrated Blackbaord.
- AutoML Libraries: realized as adapters that implement AutoML solutions used to generate new ML models.
- Data layer: the persistence layer provides the ability to reason using the ML Ontology or retrieve/add records to the MongoDB.
The data schema used within MongoDB can be seen below.
This is an example:
{
"_id": {"633db7c62d3f570e97e76338”},
"name": "Titanic",
"type": ":tabular",
"analysis": {
"size_bytes": 64794,
"number_of_columns": 12,
"number_of_rows": 891,
"missings_per_column": {
"PassengerId": 0,
"Age": 177,.....
},
"missings_per_row": [],???????????meaning
"outliers": { "Pclass": [] , "Age": [630, 851] },
"duplicate_columns": [ “Survived” ], ?? examples
"duplicate_rows": [100,230],
"columns_datatype": {
"PassengerId": “:integer”,
"Survived": “:boolean”
},
"plots": [DATASET ADVANCED ANALYSIS SEE BELOW]
},
"model_ids": ["6321a7d0483607178754bcbf"],
"path": "\\app-data\\datasets\\USER_IDENTIFIER\\DATASET_IDENTIFIER\\titanic_train.csv",
"creation_time": 1663146314.8633738,
"file_name": "titanic_train.csv",
"file_configuration": "{"use_header":true, "start_row":1, "delimiter":"comma", "escape_character":"\\\\", "decimal_character":"."}",
"online_predictions": [{
"file_name": "titanic_test.csv",
"path": "\\app-data\\datasets\\USER_IDENTIFIER\\DATASET_IDENTIFIER\\test_data\\titanic_test.csv",
"upload_datetime": 1663146314.8633738,
"prediction_datasets": [ “PREDICTION_DATASET_IDENTIFER”, ]
}]
}
{
"title": "Correlation Matrix",
"items": [{
"type": "correlation_matrix",
"title": "Correlation matrix",
"description": "Higher values indicate greater correlation between features",
"path": "\\app-data\\datasets\\USER_IDENTIFIER\\DATASET_IDENTIFIER\\plots\\correlation_matrix.svg"
}]
}, {
"title": "Correlation analysis",
"items": [{
"type": "feature_imbalance_plot",
"title": "Feature imbalance plot of [Pclass, Cabin]",
"description": "This plot shows the 100 most common combinations of ....",
"path": "\\app-data\\datasets\\USER_IDENTIFIER\\DATASET_IDENTIFIER\\plots\\feature_imbalance_Pclass_vs_Cabin.svg"
}]
}, {
"title": "Column analysis",
"items": [{
"type": "column_plot",
"title": "PassengerId",
"description": "This plot shows the PassengerId column",
"path": "\\app-data\\datasets\\USER_IDENTIFIER\\DATASET_IDENTIFIER\\plots\\PassengerId_column_plot.svg"
}]
}
{
"_id": { "TRAINING_IDENTIFIER"},
"dataset_id": "DATASET_IDENTIFIER",
"configuration": {
"task": ":tabular_classification",
"target": "Survived",
"enabled_strategies": [":data_preparation_ignore_redundant_features"],
"runtime_limit": 3,
"metric": ":accuracy",
"selected_auto_ml_solutions": [":autokeras"],
"selected_ml_libraries": [":keras_lib"],
},
"dataset_configuration": {
"column_datatypes": {
"PassengerId": “:integer”,.......
},
"file_configuration": {
"use_header": true,
"start_row": 1,
"delimiter": "comma",
"escape_character": "\\",
"decimal_character": "."
},
},
"status": "completed",
"model_ids": ["MODEL_IDENTIFIER"],
“runtime_profile”: {
"start_time": "2022-09-15T11:21:37.988+00:00",
"events": [SEE BELOW TRAINING EVENT],
"end_time": "2022-09-15T11:23:37.088+00:00",
}
}
{
"type": "phase_updated",
"meta": {
"old_phase": null,
"new_phase": "started"
},
"timestamp": "2022-09-15T11:23:37.088+00:00"
}
}, {
"type": "phase_updated",
"meta": {
"old_phase": "started",
"new_phase": "preprocessing"
},
"timestamp": {
"$date": {
"$numberLong": "1663157231689"
}
}
}, {
"type": "strategy_action",
"meta": {
"rule_name": "data_preparation.finish_preprocessing",
"result": null
},
"timestamp": {
"$date": {
"$numberLong": "1663157231764"
}
}
}, {
"type": "phase_updated",
"meta": {
"old_phase": "preprocessing",
"new_phase": "running"
},
"timestamp": {
"$date": {
"$numberLong": "1663157232768"
}
}
}, {
"type": "automl_run_finished",
"meta": {
"name": ":autokeras",
"run_metrics": {
"status": "completed",
"test_score": 0,
"validation_score": 0,
"runtime": 72,
"prediction_time": 18.47035789489746,
"model": ":artificial_neural_network",
"library": ":keras_lib"
}
},
"timestamp": {
"$date": {
"$numberLong": "1663157308621"
}
}
}, {
"type": "phase_updated",
"meta": {
"old_phase": "running",
"new_phase": "stopped"
},
"timestamp": {
"$date": {
"$numberLong": "1663157308624"
}
}
}
{
"_id": “MODEL_IDENTIFIER”
"training_id": "TRAINING_ID",
“prediction_ids” : [PREDICTION_ID]
"auto_ml_solution": ":autokeras",
"path": "/app-data/training\\USER_IDENTIFIER\\TRAINING_IDENTIFIER\\export\\keras-export.zip",
"test_score": 0.8193041682243347,
“runtime_profile” : {
"start_time": "2022-09-15T11:23:37.088+00:00",
"end_time":"2022-09-15T11:23:37.088+00:00",
“hardware_configuration”: //TODO
“carbon_footprint”: //TODO
}
"ml_model_type": ":artificial_neural_network",
"ml_library": ":keras_lib",
"status_messages": ["\n", "Search: Running Trial #1\n", "\n", "Value\n", "32 ......"],
"prediction_time": 18.47035789489746,
"explanation": {
"status": "finished",
"detail": "5 plots created",
"content": [{
"title": "SHAP Explanation",
"items": [{
"type": "waterfall_plot",
"title": "Waterfall plot of Survived = False",
"description": "The waterfall plot shows the significance.....",
"path": "\\app-data\\training\\autokeras\\USER_IDENTIFIER\\TRAINING_IDENTIFIER\\result\\plots\\waterfall_Survived_False.svg"
}]
}]
}
}
{
"_id": “PREDICTION_ID”,
"model_id": "MODEL_ID",
"live_dataset_path": "\\app-data\\datasets\\USER_ID\\DATASET_ID\\predictions\\PREDICTION_ID\\titanic_test.csv",
"prediction_path": "\\app-data\\datasets\\USER_ID\\DATASET_ID\\predictions\\PREDICTION_ID\\PREDICTION_ID_flaml.csv",
"status": "completed",
"runtime_profile": {
"start_time": {
"$date": {
"$numberLong": "1665652871877"
}
},
"end_time": {
"$date": {
"$numberLong": "1665652878111"
}
}
}
}
The docker-compose setup is defined in the following files located in the root of the meta-repository:
docker-compose.yml
docker-compose-frontend.yml
docker-compose.yml
is the base docker-compose file, which is used in in the frontend docker-compose file. It defines how to start the controller, the adapters and how to create the volumes. The most interesting part here are the port mappings and the environment variables. The controller is the only container that needs a port mapping to the host machine. This is because it needs to communicate with the frontend, if the developer decides to start the frontend with a local C# installation. The controller and the adapters communicate via the automatically created docker-compose internal network with each other and therefore need not connection to the host machine. Each container gets assigned a hostname by setting the attribute container_name
. The DNS server of docker-compose will take care of the resolution of these names to the IP-addresses of the containers inside the local network. For the controller we have to specify all those hostnames and the ports that the adapters listen on.
Next let us take a look at the volumes.
Each adapter gets a separate output-*
volume. Furthermore, there is one volume created which is names output
. And finally there is one volume datasets
.
If we take a look at the volume mappings of the individual containers we can see that all output-*
volumes are mapped inside of the output
directory of the controller. But the adapters themselves have their respective output-*
volume mapped into a directory called output
. This means every adapter only sees one output directory. But the controller has access to all of those directories. Finally, each controller and the adapters have access to the datasets
volume. This way they can transfer the datasets for training.
The kubernetes setup is defined by the following files located in the root directories of each module:
*-deployment.yaml
*-service.yaml
The setup in kubernetes is essentially equivalent to the one with docker-compose. However, there are two main differences compared to the docker-compose setup:
Firstly, the file transfer, which is achieved using volumes when using docker-compose is achieved using an NFS-server when using kubernetes. But the nice thing is, that this is only a thing defined in the setup. The modules themselves do not know about this difference, because the directories are mapped to the same locations in both cases. The mapped directories are defined in the *-deployment.yml
files of the respective modules.
Secondly, the environment variables <SERVICE_NAME>_SERVICE_HOST
and <SERVICE_NAME>_SERVICE_PORT
which have been defined in the docker-compose.yml
file manually are set by kubernetes dynamically without any possibility for configuration. In explanation, in the *-service.yml
files the pods are given a name. That name is used by kubernetes as the hostname of the pod. Kubernetes will then set environment variables called <SERVICE_NAME>_SERVICE_HOST
and <SERVICE_NAME>_SERVICE_PORT
on a newly created service for all so far running services. This means that we cannot set the names arbitrarily. It also means that all adapters have to be deployed to the cluster before the controller, so that the environment variables corresponding to each adapter already exist at the creation time of the controller and then are set on the controller correctly.
gRPC is a protocol for remotely calling functions in distributed computer systems over HTTP and protocol buffers. MetaAutoML uses gRPC for the communication between its components (Docker containers) so e.g. the frontend, the controller and the adapters. Which functions can be called as well as their parameters is specified in a interface description language. These scripts can be found in the controllers and adapters and are named like
<Controller/Adapter>BGRPC.py
THESE FILES ARE AUTOMATICALLY GENERATED AND SHOULD NEVER BE CHANGED MANUALLY
To change this interface or add new parameters the *_init_*
files have to be regenerated and their content must replace the corresponding <Controller/Adapter>BGRPC.py
.
This is done as follows:
The betterproto compiler is required to compile the GRPC python files, create a new venv inside the /utils/Helper/RPC folder. Then install the requirements as listed inside requirements for better grpc tools file.
Firstly the .proto
files have to be changed with the desired adjustments. These files can be found in /utils/Helper/RPC. Controller.proto
contains the communication specification between the frontend and the controller. Adapter.proto
contains the communication between the controller and the adapters.
Create a new termin inside the /utils/Helper/RPC and execute the respective command to rebuild the python GRPC interface:
Controller
cd controller
python -m grpc_tools.protoc -I . --python_betterproto_out=out ControllerService.proto
Adapter
cd adapter
python -m grpc_tools.protoc -I . --python_betterproto_out=out AdapterService.proto
overwrite the content of <Controller/Adapter>BGRPC.py
with the content of the newly generated files into their locations within the adapter location, the controller location.
The BlazorBoilderplate frontend does not use the generated gRPC files but instead uses its own copy of the Controller.proto and all message.proto files. Therefore, to make the frontend use the changed configuration copy the Controller.proto file to MetaAutoML-Frontend/src/Shared/BlazorBoilerplate.Constants/Protos.
The above graphic shows the architecture of the used ML ontology: