-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathRandomForestRegression.Rmd
276 lines (195 loc) · 11.3 KB
/
RandomForestRegression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
---
title: "Tutorial: Random Forest Regression in R"
output:
html_document:
df_print: paged
toc: true
toc_float: true
fig_width: 7
fig_height: 7
---
By Ángela Jiang-Wang
R version used: 4.1.1 (2021-08-10) – “Kick Things”
Last updated: January 2022
*You can reach me out on [twitter](https://twitter.com/angyjiwa) if you have any doubts or encounter any problems in this tutorial (general feedback is also welcome). Hope you find it helpful!*
# Introduction and dataset description
For this tutorial, I will be using a public dataset downloaded from UCI Machine Learning Repository called ["Student Performance - Mathematics"](https://archive.ics.uci.edu/ml/datasets/Student+Performance). You don't need to download the dataset to replicate the steps, as you will see below.
This dataset contains 33 attributes collected during the 2005-
2006 school year by using school reports and questionnaires in secondary education of two Portuguese schools.
The original authors provide a detailed description of the variables in the webpage, that I will copy here:
1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)
31. G1 - first period grade in Math (numeric: from 0 to 20)
31. G2 - second period grade in Math (numeric: from 0 to 20)
32. G3 - final grade in Math (numeric: from 0 to 20, output target)
**My goal will be to create a model with Random Forest that predicts the final grade of a student in Mathematics ("G3").**
# Examining and preparing the data
For this exercise, I will be using the packages tidyverse, caret and randomForest
```{r message=FALSE, warning=FALSE, paged.print=FALSE, results= 'hide'}
library(randomForest) #to implement the random forest model
library(tidyverse) #to plot the graph at the end of this exercise
library(caret) #to convert the categorical variables into dummies
```
I will open the dataset and examine the variables. You don't actually need to download the csv file to open this dataset in R, as R will automatically import the dataset from the URL I provide below (from my public github repository). If you prefer, you can also download the dataset and the original RMarkdown file [here](https://github.com/angelajw/RandomForestRegression)
```{r}
d1<-read.csv("https://raw.githubusercontent.com/angelajw/RandomForestRegression/main/student-mat.csv", sep=";")
head(d1)
```
The dataset has 395 observations and 33 variables.
```{r}
dim(d1)
```
The mean age is 16.70 and 53% are female students
```{r}
nrow(d1[which(d1$sex == "F"),])/nrow(d1)*100
mean(d1$age)
```
I check if there are any missing values (there aren't)
```{r}
colSums(is.na(d1))
```
# Splitting the data
Next, I will split the data into training and testing in a 80:20 ratio (although the standard is 75:25, I choose 80:20 since the number of observations in this dataset is not that large so I want to have a larger training dataset)
```{r}
set.seed(100) #for reproducibility
sample <- sample(nrow(d1), 0.80*nrow(d1), replace = FALSE)
train <- d1[sample,]
test <- d1[-sample,]
dim(train)
dim(test)
```
# Implementing a Random Forest model
I will run a model with G3 as the outcome variable using the default parameters (number of trees to grow: *ntree* = **500** and number of variables randomly sampled as candidates at each split: *mtry* = **1/3 of the predictor variables**).
According to the randomForest package documentation, the description of the parameter *importance* is: "Should importance of predictors be assessed?". I will set *importance* as **TRUE**.
According to the randomForest package documentation, the argument *Y* stands for "A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, randomForest will run in unsupervised mode." As Y in this case (G3) is a integer, the random forest model will be a **regression** instead of a classification or an unsupervised model.
```{r}
set.seed(100)
model<-randomForest(G3 ~ . , data=train, importance = TRUE)
model
sqrt(2.974508) #rmse
```
Effectively, the type of random forest implemented is a regression. The % of variance explained by this model is 85.49 and the root mean squared error is a grade score of 1.72
I will plot the graph between error vs. number of trees:
```{r}
plot(model)
```
The error stabilizes at around 100 trees.
# Tuning the model p.1: setting number of candidates considered by each tree
To tune the number of candidates considered by each tree, I will use the tuneRF function and search for the optimal number of candidates where the OOB error stops improving by 1% (this is a predefined amount that I establish). I also establish that mtry will increase by a factor of 1.3 at each iteration.
```{r}
features <- setdiff(names(train), "G3")
set.seed(101)
model2 <- tuneRF(
x = train[features],
y = train$G3,
ntreeTry = 500,
mtryStart = 10, #starting value of mtry
stepFactor = 1.3, #by how much the mtry is inflated at each iteration
improve = 0.01, #the (relative) improvement in OOB error must be by this much for the search to continue
trace = FALSE, #whether to print the progress of the search
plot = TRUE,
doBest = TRUE #whether to run a forest using the optimal mtry found
)
model2
sqrt(2.377096)
```
In the graph, I can see that the OOB error stops improving at mtry = 26. With this parameter tuned, **my model now explains 88.4% of the variance in the data and the root mean squared error is a grade score of 1.54.**
# Tuning the model p.2: converting categorical into dummies
The randomForest package is designed to handle categorical predictors in the model. However, I will check whether converting them into dummies could lead to less bias in the data and improve the model performance.
For this, I will be converting all categorical variables into dummies with the package caret. The argument fullRank ensures that for k levels of a given variable there are always k-1 dummies created, to avoid linear dependency
```{r}
dmy <- dummyVars(" ~ .", data = d1, fullRank=T)
d2 <- data.frame(predict(dmy, newdata = d1))
head(d2)
```
I will now repeat all the previous steps with the categorical predictors converted into dummies
```{r}
#splitting the data
set.seed(100)
sampledummy <- sample(nrow(d1), 0.80*nrow(d2), replace = FALSE)
traindummy <- d2[sampledummy,]
testdummy <- d2[-sampledummy,]
#modelling
set.seed(100)
modeldummy<-randomForest(G3 ~ . , data=traindummy, importance = TRUE)
modeldummy
#tuning the mtry parameter
features <- setdiff(names(traindummy), "G3")
set.seed(101)
modeldummy2 <- tuneRF(
x = traindummy[features],
y = traindummy$G3,
ntreeTry = 500,
mtryStart = 10, #starting value of mtry
stepFactor = 1.3, #by how much the mtry is inflated at each iteration
improve = 0.01, #the (relative) improvement in OOB error must be by this much for the search to continue
trace = FALSE, #whether to print the progress of the search
plot = TRUE,
doBest = TRUE #whether to run a forest using the optimal mtry found
)
modeldummy2
sqrt(2.342395) #rmse
```
**The new model explains 88.57% of the variance and the root mean squared error is a grade score of 1.53.** As this is a slight improvement in performance, I will use this new dataset and model from here on
# Checking variable importance
I will evaluate now the importance of each variable in the model
```{r}
importance(modeldummy2)
varImpPlot(modeldummy2)
```
**As expected, G2 is the most important predictor for G3. G1 and absences are the next most important predictors.**
# Using the Random Forest model to generate predictions
I will now use the model previously created with the training data to generate predictions in the test data
```{r}
pred = predict(modeldummy2, newdata=testdummy)
head(pred)
head(testdummy$G3)
```
I will now compute the root mean squared error of my model prediction vs. the real grades in the test dataset
```{r}
sqrt(sum((pred - testdummy$G3)^2) / nrow(testdummy)) #rmse
```
**The average error rate between the real grades in the test dataset and the predicted grades is a grade score of 1.97**
I will now plot the real and predicted scores for the test dataset with the package tidyverse
```{r message=FALSE, warning=FALSE, paged.print=FALSE, results='hide'}
test_results <- testdummy %>%
select(G3) %>%
bind_cols(pred)
colnames(test_results)[2] <- "pred"
```
```{r warning=FALSE}
ggplot(test_results, aes(x = G3, y = pred)) +
geom_point(alpha = 0.5) +
geom_point(color = 'blue', linetype = 2) +
geom_abline() +
labs(x = 'Actual Grades', y = 'Predicted Grades')
```
And that's it!