This project aims to predict the occurrence of diabetes using machine learning techniques. The dataset used for this analysis is the "diabetes_prediction_dataset.csv" file, which contains various features related to an individual's health condition.
- The dataset is loaded into a Pandas DataFrame.
- A random sample of 1000 instances is selected from the initial dataset.
- Categorical features like 'gender' and 'smoking_history' are label-encoded.
- The target variable 'diabetes' is separated from the feature variables.
- The Synthetic Minority Over-sampling Technique (SMOTE) is applied to balance the target variable classes.
- A correlation heatmap is generated to visualize the relationships between features.
- The age feature is binned into groups for better visualization.
- Feature scaling is performed using StandardScaler.
- A grid search is performed to find the optimal hyperparameters for Random Forest and AdaBoost classifiers.
- The feature importances of the Random Forest model are visualized.
- The dataset is split into training and testing sets.
- Several machine learning models are trained and evaluated on the test set:
- Random Forest Classifier
- Logistic Regression
- AdaBoost Classifier
- Decision Tree Classifier
- Gradient Boosting Classifier
- Performance metrics like accuracy, precision, recall, and F1-score are calculated for each model.
- The Receiver Operating Characteristic (ROC) curve and Area Under the Curve (AUC) are plotted for the Gradient Boosting Classifier.
- A simple neural network is trained and evaluated on the dataset.
- A new data point is defined for prediction.
- The new data is scaled using the same scaler as the training data.
- The trained Gradient Boosting Classifier model predicts the probability of diabetes for the new data point.
To run this project, you'll need to have the following dependencies installed:
- Python
- NumPy
- Pandas
- Matplotlib
- Seaborn
- Scikit-learn
- Imbalanced-learn
- TensorFlow (for the neural network part)
You'll also need to have the 'diabetes_prediction_dataset.csv' file in the same directory as your Python script.
Contributions are welcome! If you find any issues or have suggestions for improvements, please open an issue or submit a pull request.