6.10 Homework

The goal of this homework is to create a tree-based regression model for prediction apartment prices (column 'price').

In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.

You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.

For this homework, we prepared a starter notebook.

Solution

Notebook with solution

Loading the data

Use only the following columns:
- 'neighbourhood_group',
- 'room_type',
- 'latitude',
- 'longitude',
- 'minimum_nights',
- 'number_of_reviews','reviews_per_month',
- 'calculated_host_listings_count',
- 'availability_365',
- 'price'
Fill NAs with 0
Apply the log tranform to price
Do train/validation/test split with 60%/20%/20% distribution.
Use the train_test_split function and set the random_state parameter to 1
Use DictVectorizer to turn the dataframe into matrices

Question 1

Let's train a decision tree regressor to predict the price variable.

Train a model with max_depth=1

Which feature is used for splitting the data?

room_type
neighbourhood_group
number_of_reviews
reviews_per_month

Question 2

Train a random forest model with these parameters:

n_estimators=10
random_state=1
n_jobs=-1 (optional - to make training faster)

What's the RMSE of this model on validation?

0.059
0.259
0.459
0.659

Question 3

Now let's experiment with the n_estimators parameter

Try different values of this parameter from 10 to 200 with step 10
Set random_state to 1
Evaluate the model on the validation dataset

After which value of n_estimators does RMSE stop improving?

10
50
70
120

Question 4

Let's select the best max_depth:

Try different values of max_depth: [10, 15, 20, 25]
For each of these values, try different values of n_estimators from 10 till 200 (with step 10)
Fix the random seed: random_state=1

What's the best max_depth:

10
15
20
25

Bonus question (not graded):

Will the answer be different if we change the seed for the model?

Question 5

We can extract feature importance information from tree-based models.

At each step of the decision tree learning algorith, it finds the best split. When doint it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the imporatant features for tree-based models.

In Scikit-Learn, tree-based models contain this information in the feature_importances_ field.

For this homework question, we'll find the most important feature:

Train the model with these parametes:
- n_estimators=10,
- max_depth=20,
- random_state=1,
- n_jobs=-1 (optional)
Get the feature importance information from this model

What's the most important feature?

neighbourhood_group=Manhattan
room_type=Entire home/apt
longitude
latitude

Question 6

Now let's train an XGBoost model! For this question, we'll tune the eta parameter

Install XGBoost
Create DMatrix for train and validation
Create a watchlist
Train a model with these parameters for 100 rounds:

xgb_params = {
    'eta': 0.3, 
    'max_depth': 6,
    'min_child_weight': 1,
    
    'objective': 'reg:squarederror',
    'nthread': 8,
    
    'seed': 1,
    'verbosity': 1,
}

Now change eta first to 0.1 and then to 0.01

Which eta leads to the best RMSE score on the validation dataset?

0.3
0.1
0.01

Submit the results

Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8

It's possible that your answers won't match exactly. If it's the case, select the closest one.

Deadline

The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.

Navigation

Machine Learning Zoomcamp course
Session 6: Decision Trees and Ensemble Learning
Previous: Explore more

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

homework.md

homework.md

6.10 Homework

Solution

Loading the data

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline

Navigation

Files

homework.md

Latest commit

History

homework.md

File metadata and controls

6.10 Homework

Solution

Loading the data

Question 1

Question 2

Question 3

Question 4

Question 5

Question 6

Submit the results

Deadline

Navigation