Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

data_preprocessing pull req #1

Merged
merged 11 commits into from
Dec 31, 2021
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
18 changes: 18 additions & 0 deletions notebooks/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
## Meta Learning for Automated Data Pre-processing for clustering
### Introduction
The repository contains code of thesis project titled, "Meta Learning for Automated Data Pre-processing for clustering". The objective of the thesis project is to
provide a data preprocessing pipeline for unsupervised clustering leveraging meta learning method. There are wide range of Data pre-processing methods, such as replacing missing values, scaling, and data reduction. The aim of this project is to automate data pre-processing by leveraging Automated Machine Learning (AutoML). While supervised learning has been in core focus of AutoML research, unsupervised learning remained comparatively less unexplored. Therefore, the thesis focuses on suggesting a data pre-processing pipeline for an unsupervised clustering task by exploiting meta-learning space and meta learners in a domain-agnostic manner for users who does not have in depth knowledge of the machine learning algorithms and how data preprocessing changes the data and the behavior of the algorithms. The thesis explores the potential of integrating data preprocessing in the cSmartML tool for unsupervised clustering. The proposed method applies meta learning and creates a knowledge space applying three clustering algorithms and a seven data transformation methods to transform the data. The knowledge space contains multi CVI ranking correlation score for original and transformed data that are clustered individually for three algorithms. The clustering algorithms were also tuned with a range of hyper-parameters to find the best setting. From the initial results, it is shown that data preprocessing or transformation improves the clustering result and it also shows comparatively better result when compared again cSmartML results. This shows that data preprocessing for unsupervised clustering task is important. Additionally based on the meta learning space, the project also proposes user a data preprocessing pipeline for further clustering.


### Resources
There are two notebook files and a meta_data.csv in this repository.
Notebook 1 contains the code for meta space creation. Meta space is the meta data or the knowledge base on which the experiment was performed. Notebook 2 holds code for the analysis. Each of the notebook has comments to describe the code. After creating datasets for each clustering algorithm, all of the datasets are concatenated and meta_data.csv was created which was later used to suggest the pipeline.

### Problem
Data preprocessing is an integral and a very time consuming part of any data analysis project. Data preprocessing requires inn depth knnowledge of the machine learning algorithm and the mathematical understanding how the preprocessing transforms the data. But for majority of the people are not from computer science or mathematics background to know the nuances of the mentioned topic, hence it is difficult for them to choose which algorithm would perform best and which data preprocessing would increase their objective of the analysis. The problem is more prominent in unsupervised clustering tasks.
Traditional machine learning models cannot solve the aforementioned challenge, therefore the emergence of AutoML which provides an automated machine learning pipeline given a specific task to optimize. Although, the research in AutoML gained traction in recent years specifically in the domain of supervised learning, studies and research are incommensurate with unsupervised learning. The reason behind such disproportion is generally due to the lack of information in unsupervised problems required for validation purposes referred as "ground truth"; the true label of the clusters. This leads to the complications of deriving an optimization function which would be maximized or minimized to evaluate the quality of clusters during the AutoML process. The motivation behind this thesis stemmed from this very problem statement that aims to address the challenges.

### Methodology
At first we cluster the original data using three clusterring algorithm, KMeans, Agglomerative, and Birch. So each dataset is clustered three times with a set of hyper parameters and the Multi CVI for each of those clusters are recorded. Then we perform the transformations on each of the datasets and repeat the process again which creates more data. Each time a row of data is generated with the meta features of the specific dataset, the algorithm and hyper parameter setting, and the multi CVI correlation score. The transformations/ data preprocessing techniques are performed once on the dataset and the multi CVI score is calculated. Then a combination of two transformations are performed on the dataset and the same process if performed again. Which gives us more data for the knowledge space. Later perforrm analysis to check if preprocesssing led to better clustering result based on the multi CVI score. And given a new dataset, we caluclate the nearest neighbor based on the meta data, and the pipeline that gve the best ressult based on the multi CVI score is suggested as the best performing pipeline. The bigger and wider the knowledge space is, the accurate the suggestion would be. <br><br>

The results can be found in paper.
Loading