CS 6375 Machine Learning Project
Team members
- Ankita Patil
- Abhilash Gudasi
- Shiva Chawala
Project Source: DrivenData
Challenge Name: Pump It Up: Data Mining the Water Table
Goal: To predict the operating condition of waterpoints in Tanzania i.e. to determine whether the water pump is functional, non-functional or needs repair
- Data Exploration
- Univariate Analysis
- Correlation Graph
- Pre-processing/Feature Engineering
- Algorithm Implementation
- Model training and parameter tuning using GridSearchCV
- With train-test split
- With k-fold cross validation
- With the trained model, predict the accuracy on the test data
Justification of selecting supervised machine learning algorithms
By exploring the datasets, we observe that for each instance, a label is provided. When data with label is provided, supervised machine learning algorithms can be applied.
Following five algorithms are used in model creation for Pump It Up: Data Mining the Water Table dataset
- Logistic Regression
- Support Vector Machine
- Adaboosting
- Neural Net
- Random Forest
Folder structure as submitted :
Root(Pumpitup)
|
---Code--|
| |--Final--|
| |
| ---BestModel
| ---GridSearch
| ---Pre-process
| ---ROC
|
|
|
---Datasets--|
|
---Pump_it_Up_Data_Mining_the_Water_Table_-_Training_set_labels.csv
---Pump_it_Up_Data_Mining_the_Water_Table_-_Training_set_values.csv
---test_values_processed.csv
---train_values_processed.csv
---heights.csv
- Root folder(Pumpitup) is divided in to two folder one for code and another for Datasets.
Code-->Final:
Inside Final folder:
- BestModel contains the all five best model using different techniques we achieved in this project.
- GridSearch contains our initial exploration to find the best model trying different parameters.
- Pre-process contains one python file for generating the preprocessed data and one for generating the missing gps_heights(one of the attribute in the given dataset). The preprocessing python file will generate the processed csv file inside Dataset folder and the other python file will compute the missing gps_height values and generate heights.csv file inside Dataset folder.
- ROC contains the python file for all best models output ROC curve generation code.
Datasets: This folder contains the dataset(values and labels) we got from DataDriven competition website for PumpitUp problem. This also contains the preprocessed datasets.
-
To run the code: If the same folder structure is maintained as mentioned above
- Preprocessing: Since we are displaying various plots in this, you cannot run it in command line. So recommended to run the file in Jupyter using the PumpItUpPreprocessing.ipynb file.
Running Best Models: Go inside BestModel folder and run the below command:
Syntax: python Modelname.py Ex: python AdaBoostBest.py python DeepLearningBest.py python LogisticRegressionBest.py python RandomForestBest.py python SVMBest.py
After running above command in command line you will see
-->Confusion matrix,
-->Classification report,
-->Accuracy of the model using train-test split and
-->Accuracy of the model using kfold cross validation outputs.