Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.
Solution: homework.ipynb
In this homework, we will use the Students Performance in 2024 JAMB dataset from Kaggle.
Here's a wget-able link:
wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv
The goal of this homework is to create a regression model for predicting the performance of students on a standardized test (column 'JAMB_Score'
).
First, let's make the names lowercase:
df.columns = df.columns.str.lower().str.replace(' ', '_')
Preparation:
- Remove the
student_id
column. - Fill missing values with zeros.
- Do train/validation/test split with 60%/20%/20% distribution.
- Use the
train_test_split
function and set therandom_state
parameter to 1. - Use
DictVectorizer(sparse=True)
to turn the dataframes into matrices.
Let's train a decision tree regressor to predict the jamb_score
variable.
- Train a model with
max_depth=1
.
Which feature is used for splitting the data?
study_hours_per_week
attendance_rate
teacher_quality
distance_to_school
Train a random forest regressor with these parameters:
n_estimators=10
random_state=1
n_jobs=-1
(optional - to make training faster)
What's the RMSE of this model on the validation data?
- 22.13
- 42.13
- 62.13
- 82.12
Now let's experiment with the n_estimators
parameter
- Try different values of this parameter from 10 to 200 with step 10.
- Set
random_state
to1
. - Evaluate the model on the validation dataset.
After which value of n_estimators
does RMSE stop improving?
Consider 3 decimal places for calculating the answer.
- 10
- 25
- 80
- 200
Let's select the best max_depth
:
- Try different values of
max_depth
:[10, 15, 20, 25]
- For each of these values,
- try different values of
n_estimators
from 10 till 200 (with step 10) - calculate the mean RMSE
- try different values of
- Fix the random seed:
random_state=1
What's the best max_depth
, using the mean RMSE?
- 10
- 15
- 20
- 25
We can extract feature importance information from tree-based models.
At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.
In Scikit-Learn, tree-based models contain this information in the
feature_importances_
field.
For this homework question, we'll find the most important feature:
- Train the model with these parameters:
n_estimators=10
,max_depth=20
,random_state=1
,n_jobs=-1
(optional)
- Get the feature importance information from this model
What's the most important feature (among these 4)?
study_hours_per_week
attendance_rate
distance_to_school
teacher_quality
Now let's train an XGBoost model! For this question, we'll tune the eta
parameter:
- Install XGBoost
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:
xgb_params = {
'eta': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'objective': 'reg:squarederror',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
Now change eta
from 0.3
to 0.1
.
Which eta leads to the best RMSE score on the validation dataset?
- 0.3
- 0.1
- Both give equal value
- Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw06
- If your answer doesn't match options exactly, select the closest one