Skip to content

ml-project-1-los_caballeros_de_bogota created by GitHub Classroom

Notifications You must be signed in to change notification settings

CS-433/ml-project-1-los_caballeros_de_bogota

Repository files navigation

CS-433: Machine Learning Fall 2022, Project 1

Getting started

Project description

The aim of this project of Machine Learning is to predict if a decay signature is a Higgs Boson or some other particle. The model is based on a vector of features of a collision event between two high speed protons. More detail about the project ara available in references/project1_description.pdf. Here a regularized logistic regression is implemented and trained on 8 sub-sets of the full dataset.

Data

The Dataset comes from a popular machine learning challenge recently - finding the Higgs boson - using original data from CERN. The dataset is available at https://www.aicrowd.com/challenges/epfl-machine-learning-higgs. To reproduce the results a folder data/ should be added to the repo, as described in Repo Architecture. A detailed description of the dataset is availabel in references/The_Higgs_boson_ML_challenge.pdf.

Report

All the detailed about the choices that has been made and the methodology used throughout this project are available in report.pdf. Through this report, the reader is able to understand the different assumptions, decisions and results made during the project

Reproduce results

Requirements

  • Python==3.9.13
  • Numpy==1.21.5
  • Matplotlib

Repo Architecture

  
├─── data
    ├─── submission.csv: File generated by run.py. Contains predictions of sample from test.csv. 
    ├─── test.csv: File containing samples to be predicted.
    ├─── train.csv: File with labeled sample using for training.
├─── figures
    ├─── mass.jpeg: plots of the performance during training for mass-all_jet
    ├─── mass_jet0.png: plots of the performance during training for model mass-jet0
    ├─── mass_jet1.png: plots of the performance during training for model mass-jet1
    ├─── mass_jet2.png: plots of the performance during training for model mass-jet2
    ├─── mass_jet3.png: plots of the performance during training for model mass-jet3
    ├─── no_mass.jpeg: plots of the performance during training for mass-all_jet
    ├─── no_mass_jet0.png: plots of the performance during training for model no_mass-jet0
    ├─── no_mass_jet1.png: plots of the performance during training for model no_mass-jet1
    ├─── no_mass_jet2.png: plots of the performance during training for model no_mass-jet2
    ├─── no_mass_jet3.png: plots of the performance during training for model no_mass-jet3
├─── notebooks
    ├─── data_analysis.ipynb: Exploratory data analysis notebooks. Helps to visualize distributions of features.
    ├─── experiments.ipynb: Notebooks assessing performance of very basics models.
├─── references
    ├─── project1_description.pdf: Original description of the project provided by EPFL.
    ├─── The_Higgs_boson_ML_challenge.pdf: Reference used to understand features of the dataset.
├─── src
    ├─── __init__.py: File to define src directory as a python package
    ├─── best_models.pkl: Contains the best pretrained model. Used to plot the performance with plot_performance.py 
    ├─── best_params.pkl: File generated by optimization.py. Contains best degree and lambda_ for each sub-models. This file is loaded in run.py.
    ├─── data_processing.py: File containing implementations to process the raw data.
    ├─── helpers.py: File provided by EPFL containing methods to load the data and create submissions for aircrowd.
    ├─── model.py: File containing definition of the class Model
    ├─── utils.py: File containing useful function for computing and visualization purpose.
├─── implementations.py: File containing basics ML implementations asked in the project description.
├─── optimization.py: File used to optimize parameters. Performs cross-validation and saved best parameters in best_params.pkl. 
├─── plot_performance.py: Plots the performance of the pretrained best model.
├─── README.md: README
├─── report.pdf: Report explaining choices that has been made.
└─── run.py: File that load the dataset, trains models with parameters in best_params.pkl and generate submissison.csv.

Instructions to run

Move to the root folder and execute:

python run.py

Make sure to have all the requirements and the data folder in the root. Be aware training the models on 1000 epochs takes around 5 min on Apple silicon M1 Pro. Here the best model has been trained over 15000 epochs.

If you want to run the cross-validation move to the root folder and execute:

python optimization.py

Here the cross-validation has taken around 1h for one sub-models (on Apple silicon M1 Pro), therefore around 8 hours for the whole model.

If you want to visualize the performances of the model during the training, move to the root folder and execute:

python plot_performance.py

Results

The performances of the models is assessed on AirCrowd from data/submission.csv generated by run.py. The model achieves a global accuracy of 0.818 with a F1-score of 0.722.

Here are he performance of each sub model during the training:

IMAGE ALT TEXT HERE IMAGE ALT TEXT HERE

About

ml-project-1-los_caballeros_de_bogota created by GitHub Classroom

Resources

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published