Homework

Note: sometimes your answer doesn't match one of the options exactly. That's fine. Select the option that's closest to your solution.

Solution: homework.ipynb

Dataset

In this homework, we will use the Students Performance in 2024 JAMB dataset from Kaggle.

Here's a wget-able link:

wget https://github.com/alexeygrigorev/datasets/raw/refs/heads/master/jamb_exam_results.csv

The goal of this homework is to create a regression model for predicting the performance of students on a standardized test (column 'JAMB_Score').

Preparing the dataset

First, let's make the names lowercase:

df.columns = df.columns.str.lower().str.replace(' ', '_')

Preparation:

Remove the student_id column.
Fill missing values with zeros.
Do train/validation/test split with 60%/20%/20% distribution.
Use the train_test_split function and set the random_state parameter to 1.
Use DictVectorizer(sparse=True) to turn the dataframes into matrices.

Question 1

Let's train a decision tree regressor to predict the jamb_score variable.

Train a model with max_depth=1.

Which feature is used for splitting the data?

study_hours_per_week
attendance_rate
teacher_quality
distance_to_school

Question 2

Train a random forest regressor with these parameters:

n_estimators=10
random_state=1
n_jobs=-1 (optional - to make training faster)

What's the RMSE of this model on the validation data?

22.13
42.13
62.13
82.12

Question 3

Now let's experiment with the n_estimators parameter

Try different values of this parameter from 10 to 200 with step 10.
Set random_state to 1.
Evaluate the model on the validation dataset.

After which value of n_estimators does RMSE stop improving? Consider 3 decimal places for calculating the answer.

10
25
80
200

Question 4

Let's select the best max_depth:

Try different values of max_depth: [10, 15, 20, 25]
For each of these values,
- try different values of n_estimators from 10 till 200 (with step 10)
- calculate the mean RMSE
Fix the random seed: random_state=1

What's the best max_depth, using the mean RMSE?

10
15
20
25

Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorithm, it finds the best split. When doing it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the important features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the feature_importances_ field.

For this homework question, we'll find the most important feature:

Train the model with these parameters:
- n_estimators=10,
- max_depth=20,
- random_state=1,
- n_jobs=-1 (optional)
Get the feature importance information from this model

What's the most important feature (among these 4)?

study_hours_per_week
attendance_rate
distance_to_school
teacher_quality

Question 6

Now let's train an XGBoost model! For this question, we'll tune the eta parameter:

Install XGBoost
Create DMatrix for train and validation
Create a watchlist
Train a model with these parameters for 100 rounds:

xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

Now change eta from 0.3 to 0.1.

Which eta leads to the best RMSE score on the validation dataset?

0.3
0.1
Both give equal value

Submit the results

Submit your results here: https://courses.datatalks.club/ml-zoomcamp-2024/homework/hw06
If your answer doesn't match options exactly, select the closest one

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework.md

homework.md

Homework

Dataset

Preparing the dataset

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Files

homework.md

Latest commit

History

homework.md

File metadata and controls

Homework

Dataset

Preparing the dataset

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results