The goal of this homework is to create a tree-based regression model for prediction apartment prices (column 'price'
).
In this homework we'll again use the New York City Airbnb Open Data dataset - the same one we used in homework 2 and 3.
You can take it from Kaggle or download from here if you don't want to sign up to Kaggle.
For this homework, we prepared a starter notebook.
- Use only the following columns:
'neighbourhood_group',
'room_type',
'latitude',
'longitude',
'minimum_nights',
'number_of_reviews','reviews_per_month',
'calculated_host_listings_count',
'availability_365',
'price'
- Fill NAs with 0
- Apply the log tranform to
price
- Do train/validation/test split with 60%/20%/20% distribution.
- Use the
train_test_split
function and set therandom_state
parameter to 1 - Use
DictVectorizer
to turn the dataframe into matrices
Let's train a decision tree regressor to predict the price variable.
- Train a model with
max_depth=1
Which feature is used for splitting the data?
room_type
neighbourhood_group
number_of_reviews
reviews_per_month
Train a random forest model with these parameters:
n_estimators=10
random_state=1
n_jobs=-1
(optional - to make training faster)
What's the RMSE of this model on validation?
- 0.059
- 0.259
- 0.459
- 0.659
Now let's experiment with the n_estimators
parameter
- Try different values of this parameter from 10 to 200 with step 10
- Set
random_state
to1
- Evaluate the model on the validation dataset
After which value of n_estimators
does RMSE stop improving?
- 10
- 50
- 70
- 120
Let's select the best max_depth
:
- Try different values of
max_depth
:[10, 15, 20, 25]
- For each of these values, try different values of
n_estimators
from 10 till 200 (with step 10) - Fix the random seed:
random_state=1
What's the best max_depth
:
- 10
- 15
- 20
- 25
Bonus question (not graded):
Will the answer be different if we change the seed for the model?
We can extract feature importance information from tree-based models.
At each step of the decision tree learning algorith, it finds the best split. When doint it, we can calculate "gain" - the reduction in impurity before and after the split. This gain is quite useful in understanding what are the imporatant features for tree-based models.
In Scikit-Learn, tree-based models contain this information in the
feature_importances_
field.
For this homework question, we'll find the most important feature:
- Train the model with these parametes:
n_estimators=10
,max_depth=20
,random_state=1
,n_jobs=-1
(optional)
- Get the feature importance information from this model
What's the most important feature?
neighbourhood_group=Manhattan
room_type=Entire home/apt
longitude
latitude
Now let's train an XGBoost model! For this question, we'll tune the eta
parameter
- Install XGBoost
- Create DMatrix for train and validation
- Create a watchlist
- Train a model with these parameters for 100 rounds:
xgb_params = {
'eta': 0.3,
'max_depth': 6,
'min_child_weight': 1,
'objective': 'reg:squarederror',
'nthread': 8,
'seed': 1,
'verbosity': 1,
}
Now change eta
first to 0.1
and then to 0.01
Which eta leads to the best RMSE score on the validation dataset?
- 0.3
- 0.1
- 0.01
Submit your results here: https://forms.gle/wQgFkYE6CtdDed4w8
It's possible that your answers won't match exactly. If it's the case, select the closest one.
The deadline for submitting is 20 October 2021, 17:00 CET (Wednesday). After that, the form will be closed.