RandomForestRegression.Rmd

---
title: "Tutorial: Random Forest Regression in R"
output:
  html_document:
    df_print: paged
    toc: true
    toc_float: true
    fig_width: 7
    fig_height: 7
---

By Ángela Jiang-Wang

R version used: 4.1.1 (2021-08-10) – “Kick Things”

Last updated: January 2022



*You can reach me out on [twitter](https://twitter.com/angyjiwa) if you have any doubts or encounter any problems in this tutorial (general feedback is also welcome). Hope you find it helpful!*




# Introduction and dataset description

For this tutorial, I will be using a public dataset downloaded from UCI Machine Learning Repository called ["Student Performance - Mathematics"](https://archive.ics.uci.edu/ml/datasets/Student+Performance). You don't need to download the dataset to replicate the steps, as you will see below.

This dataset contains 33 attributes collected during the 2005-
2006 school year by using school reports and questionnaires in secondary education of two Portuguese schools.
The original authors provide a detailed description of the variables in the webpage, that I will copy here:

1. school - student's school (binary: 'GP' - Gabriel Pereira or 'MS' - Mousinho da Silveira)
2. sex - student's sex (binary: 'F' - female or 'M' - male)
3. age - student's age (numeric: from 15 to 22)
4. address - student's home address type (binary: 'U' - urban or 'R' - rural)
5. famsize - family size (binary: 'LE3' - less or equal to 3 or 'GT3' - greater than 3)
6. Pstatus - parent's cohabitation status (binary: 'T' - living together or 'A' - apart)
7. Medu - mother's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
8. Fedu - father's education (numeric: 0 - none, 1 - primary education (4th grade), 2 - 5th to 9th grade, 3 - secondary education or 4 - higher education)
9. Mjob - mother's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
10. Fjob - father's job (nominal: 'teacher', 'health' care related, civil 'services' (e.g. administrative or police), 'at_home' or 'other')
11. reason - reason to choose this school (nominal: close to 'home', school 'reputation', 'course' preference or 'other')
12. guardian - student's guardian (nominal: 'mother', 'father' or 'other')
13. traveltime - home to school travel time (numeric: 1 - <15 min., 2 - 15 to 30 min., 3 - 30 min. to 1 hour, or 4 - >1 hour)
14. studytime - weekly study time (numeric: 1 - <2 hours, 2 - 2 to 5 hours, 3 - 5 to 10 hours, or 4 - >10 hours)
15. failures - number of past class failures (numeric: n if 1<=n<3, else 4)
16. schoolsup - extra educational support (binary: yes or no)
17. famsup - family educational support (binary: yes or no)
18. paid - extra paid classes within the course subject (binary: yes or no)
19. activities - extra-curricular activities (binary: yes or no)
20. nursery - attended nursery school (binary: yes or no)
21. higher - wants to take higher education (binary: yes or no)
22. internet - Internet access at home (binary: yes or no)
23. romantic - with a romantic relationship (binary: yes or no)
24. famrel - quality of family relationships (numeric: from 1 - very bad to 5 - excellent)
25. freetime - free time after school (numeric: from 1 - very low to 5 - very high)
26. goout - going out with friends (numeric: from 1 - very low to 5 - very high)
27. Dalc - workday alcohol consumption (numeric: from 1 - very low to 5 - very high)
28. Walc - weekend alcohol consumption (numeric: from 1 - very low to 5 - very high)
29. health - current health status (numeric: from 1 - very bad to 5 - very good)
30. absences - number of school absences (numeric: from 0 to 93)
31. G1 - first period grade in Math (numeric: from 0 to 20)
31. G2 - second period grade in Math  (numeric: from 0 to 20)
32. G3 - final grade in Math  (numeric: from 0 to 20, output target)


**My goal will be to create a model with Random Forest that predicts the final grade of a student in Mathematics ("G3").**

# Examining and preparing the data

For this exercise, I will be using the packages tidyverse, caret and randomForest

```{r message=FALSE, warning=FALSE, paged.print=FALSE, results= 'hide'}
library(randomForest) #to implement the random forest model
library(tidyverse) #to plot the graph at the end of this exercise
library(caret) #to convert the categorical variables into dummies
```

I will open the dataset and examine the variables. You don't actually need to download the csv file to open this dataset in R, as R will automatically import the dataset from the URL I provide below (from my public github repository). If you prefer, you can also download the dataset and the original RMarkdown file [here](https://github.com/angelajw/RandomForestRegression)

```{r}
d1<-read.csv("https://raw.githubusercontent.com/angelajw/RandomForestRegression/main/student-mat.csv", sep=";")
head(d1)
```

The dataset has 395 observations and 33 variables.

```{r}
dim(d1)
```

The mean age is 16.70 and 53% are female students

```{r}
nrow(d1[which(d1$sex == "F"),])/nrow(d1)*100
mean(d1$age)
```

I check if there are any missing values (there aren't)
```{r}
colSums(is.na(d1))

```

# Splitting the data

Next, I will split the data into training and testing in a 80:20 ratio (although the standard is 75:25, I choose 80:20 since the number of observations in this dataset is not that large so I want to have a larger training dataset)

```{r}
set.seed(100) #for reproducibility
sample <- sample(nrow(d1), 0.80*nrow(d1), replace = FALSE)
train <- d1[sample,]
test <- d1[-sample,]
dim(train)
dim(test)
```

# Implementing a Random Forest model

I will run a model with G3 as the outcome variable using the default parameters (number of trees to grow: *ntree* = **500** and number of variables randomly sampled as candidates at each split: *mtry* = **1/3 of the predictor variables**).

According to the randomForest package documentation, the description of the parameter *importance* is: "Should importance of predictors be assessed?". I will set *importance* as **TRUE**.

According to the randomForest package documentation, the argument *Y* stands for "A response vector. If a factor, classification is assumed, otherwise regression is assumed. If omitted, randomForest will run in unsupervised mode." As Y in this case (G3) is a integer, the random forest model will be a **regression** instead of a classification or an unsupervised model.

```{r}
set.seed(100)
model<-randomForest(G3 ~ . , data=train, importance = TRUE)
model
sqrt(2.974508) #rmse
```

Effectively, the type of random forest implemented is a regression. The % of variance explained by this model is 85.49 and the root mean squared error is a grade score of 1.72

I will plot the graph between error vs. number of trees:

```{r}
plot(model)

```


The error stabilizes at around 100 trees.

# Tuning the model p.1: setting number of candidates considered by each tree

To tune the number of candidates considered by each tree, I will use the tuneRF function and search for the optimal number of candidates where the OOB error stops improving by 1% (this is a predefined amount that I establish). I also establish that mtry will increase by a factor of 1.3 at each iteration.

```{r}
features <- setdiff(names(train), "G3")

set.seed(101)

model2 <- tuneRF(
  x          = train[features],
  y          = train$G3,
  ntreeTry   = 500, 
  mtryStart = 10, #starting value of mtry
  stepFactor = 1.3, #by how much the mtry is inflated at each iteration
  improve    = 0.01, #the (relative) improvement in OOB error must be by this much for the search to continue
  trace      = FALSE,    #whether to print the progress of the search
  plot = TRUE,
  doBest = TRUE #whether to run a forest using the optimal mtry found
)
model2
sqrt(2.377096)
```

In the graph, I can see that the OOB error stops improving at mtry = 26. With this parameter tuned, **my model now explains 88.4% of the variance in the data and the root mean squared error is a grade score of 1.54.**

# Tuning the model p.2: converting categorical into dummies

The randomForest package is designed to handle categorical predictors in the model. However, I will check whether converting them into dummies could lead to less bias in the data and improve the model performance. 

For this, I will be converting all categorical variables into dummies with the package caret. The argument fullRank ensures that for k levels of a given variable there are always k-1 dummies created, to avoid linear dependency


```{r}
dmy <- dummyVars(" ~ .", data = d1, fullRank=T)
d2 <- data.frame(predict(dmy, newdata = d1))
head(d2)
```

I will now repeat all the previous steps with the categorical predictors converted into dummies

```{r}
#splitting the data
set.seed(100)
sampledummy <- sample(nrow(d1), 0.80*nrow(d2), replace = FALSE)
traindummy <- d2[sampledummy,]
testdummy <- d2[-sampledummy,]

#modelling
set.seed(100)
modeldummy<-randomForest(G3 ~ . , data=traindummy, importance = TRUE)
modeldummy

#tuning the mtry parameter
features <- setdiff(names(traindummy), "G3")

set.seed(101)

modeldummy2 <- tuneRF(
  x          = traindummy[features],
  y          = traindummy$G3,
  ntreeTry   = 500, 
  mtryStart = 10, #starting value of mtry
  stepFactor = 1.3, #by how much the mtry is inflated at each iteration
  improve    = 0.01, #the (relative) improvement in OOB error must be by this much for the search to continue
  trace      = FALSE,    #whether to print the progress of the search
  plot = TRUE,
  doBest = TRUE #whether to run a forest using the optimal mtry found
)
modeldummy2
sqrt(2.342395) #rmse
```

**The new model explains 88.57% of the variance and the root mean squared error is a grade score of 1.53.** As this is a slight improvement in performance, I will use this new dataset and model from here on

# Checking variable importance

I will evaluate now the importance of each variable in the model

```{r}
importance(modeldummy2)
varImpPlot(modeldummy2)
```



**As expected, G2 is the most important predictor for G3. G1 and absences are the next most important predictors.**

# Using the Random Forest model to generate predictions

I will now use the model previously created with the training data to generate predictions in the test data

```{r}
pred = predict(modeldummy2, newdata=testdummy)
head(pred)
head(testdummy$G3)

```

I will now compute the root mean squared error of my model prediction vs. the real grades in the test dataset

```{r}
sqrt(sum((pred - testdummy$G3)^2) / nrow(testdummy)) #rmse

```

**The average error rate between the real grades in the test dataset and the predicted grades is a grade score of 1.97**



I will now plot the real and predicted scores for the test dataset with the package tidyverse

```{r message=FALSE, warning=FALSE, paged.print=FALSE, results='hide'}
test_results <- testdummy %>% 
  select(G3) %>% 
  bind_cols(pred)
colnames(test_results)[2] <- "pred"
```

```{r warning=FALSE}
ggplot(test_results, aes(x = G3, y = pred)) +
  geom_point(alpha = 0.5) + 
  geom_point(color = 'blue', linetype = 2) +
  geom_abline() +
  labs(x = 'Actual Grades', y = 'Predicted Grades') 
```


And that's it!