Part_3.html

<!DOCTYPE html>
<html xmlns="http://www.w3.org/1999/xhtml" lang="" xml:lang="">
  <head>
    <title>Applied Machine Learning</title>
    <meta charset="utf-8" />
    <meta name="author" content="Max Kuhn and Davis Vaughan (RStudio)" />
    <meta name="date" content="2020-01-26" />
    <link href="libs/remark-css-0.0.1/default.css" rel="stylesheet" />
    <script src="libs/kePrint-0.0.1/kePrint.js"></script>
    <link href="libs/countdown-0.3.3/countdown.css" rel="stylesheet" />
    <script src="libs/countdown-0.3.3/countdown.js"></script>
    <link rel="stylesheet" href="https://use.fontawesome.com/releases/v5.8.2/css/all.css" integrity="sha384-oS3vJWv+0UjzBfQzYUhtDYW+Pj2yciDJxpsK1OYPAYjqT085Qq/1cq5FLXAZQ7Ay" crossorigin="anonymous">
    <link rel="stylesheet" href="assets/css/aml-theme.css" type="text/css" />
    <link rel="stylesheet" href="assets/css/aml-fonts.css" type="text/css" />
  </head>
  <body>
    <textarea id="source">


class: title-slide, center

&lt;span class="fa-stack fa-4x"&gt;
  &lt;i class="fa fa-circle fa-stack-2x" style="color: #ffffff;"&gt;&lt;/i&gt;
  &lt;strong class="fa-stack-1x" style="color:#E7553C;"&gt;3&lt;/strong&gt;
&lt;/span&gt; 

# Applied Machine Learning

## Feature Engineering


---
# Loading


```r
library(tidymodels)
```

```
## ── Attaching packages ─────────────────────────────────────────────────────────── tidymodels 0.0.4 ──
```

```
## ✓ broom     0.5.3     ✓ recipes   0.1.9
## ✓ dials     0.0.4     ✓ rsample   0.0.5
## ✓ dplyr     0.8.3     ✓ tibble    2.1.3
## ✓ ggplot2   3.2.1     ✓ tune      0.0.1
## ✓ infer     0.5.1     ✓ workflows 0.1.0
## ✓ parsnip   0.0.5     ✓ yardstick 0.0.5
## ✓ purrr     0.3.3
```

```
## ── Conflicts ────────────────────────────────────────────────────────────── tidymodels_conflicts() ──
## x purrr::discard()    masks scales::discard()
## x dplyr::filter()     masks stats::filter()
## x dplyr::lag()        masks stats::lag()
## x ggplot2::margin()   masks dials::margin()
## x recipes::step()     masks stats::step()
## x recipes::yj_trans() masks scales::yj_trans()
```


---
# Previously


```r
library(AmesHousing)

ames &lt;- make_ames() %&gt;% 
  dplyr::select(-matches("Qu"))

set.seed(4595)

data_split &lt;- initial_split(ames, strata = "Sale_Price")

ames_train &lt;- training(data_split)
ames_test  &lt;- testing(data_split)

lm_mod &lt;- linear_reg() %&gt;% 
  set_engine("lm")

perf_metrics &lt;- metric_set(rmse, rsq, ccc)
```

---

layout: false
class: inverse, middle, center

#  Feature Engineering

---
# Preprocessing and Feature Engineering

This part mostly concerns what we can _do_ to our variables to make the models more effective. 

This is mostly related to the predictors. Operations that we might use are:

* transformations of individual predictors or groups of variables

* alternate encodings of a variable

* elimination of predictors (unsupervised)

In statistics, this is generally called _preprocessing_ the data. As usual, the computer science side of modeling has a much flashier name: _feature engineering_. 


---

# Reasons for Modifying the Data

* Some models (_K_-NN, SVMs, PLS, neural networks) require that the predictor variables have the same units. **Centering** and **scaling** the predictors can be used for this purpose. 

* Other models are very sensitive to correlations between the predictors and **filters** or **PCA signal extraction** can improve the model. 

* As we'll see in an example, changing the scale of the predictors using a **transformation** can lead to a big improvement. 

* In other cases, the data can be **encoded** in a way that maximizes its effect on the model. Representing the date as the day of the week can be very effective for modeling public transportation data. 

---

# Reasons for Modifying the Data

* Many models cannot cope with missing data so **imputation** strategies might be necessary.  

* Development of new _features_ that represent something important to the outcome (e.g. compute distances to public transportation, university buildings, public schools, etc.) 

---
layout: false
class: inverse, middle, center

# Preprocessing Categorical Predictors


---

# Dummy Variables


One common procedure for modeling is to create numeric representations of categorical data. This is usually done via _dummy variables_: a set of binary 0/1 variables for different levels of an R factor. 

For example, the Ames housing data contains a predictor called `Alley` with levels: 'Gravel', 'No_Alley_Access', 'Paved'. 

Most dummy variable procedures would make _two_ numeric variables from this predictor that are 1 when the observation has that level, and 0 otherwise.

&lt;table class="table" style="width: auto !important; margin-left: auto; margin-right: auto;"&gt;
 &lt;thead&gt;
&lt;tr&gt;
&lt;th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="1"&gt;&lt;div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; "&gt;Data&lt;/div&gt;&lt;/th&gt;
&lt;th style="border-bottom:hidden; padding-bottom:0; padding-left:3px;padding-right:3px;text-align: center; " colspan="2"&gt;&lt;div style="border-bottom: 1px solid #ddd; padding-bottom: 5px; "&gt;Dummy Variables&lt;/div&gt;&lt;/th&gt;
&lt;/tr&gt;
  &lt;tr&gt;
   &lt;th style="text-align:left;"&gt;   &lt;/th&gt;
   &lt;th style="text-align:right;"&gt; No_Alley_Access &lt;/th&gt;
   &lt;th style="text-align:right;"&gt; Paved &lt;/th&gt;
  &lt;/tr&gt;
 &lt;/thead&gt;
&lt;tbody&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Gravel &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 0 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 0 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; No_Alley_Access &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 0 &lt;/td&gt;
  &lt;/tr&gt;
  &lt;tr&gt;
   &lt;td style="text-align:left;"&gt; Paved &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 0 &lt;/td&gt;
   &lt;td style="text-align:right;"&gt; 1 &lt;/td&gt;
  &lt;/tr&gt;
&lt;/tbody&gt;
&lt;/table&gt;


---

# Dummy Variables

If there are _C_ levels of the factor, only _C_-1 dummy variables are created since the last can be inferred from the others. There are different contrast schemes for creating the new variables. 

How do you create them in R? 

The formula method does this for you&lt;sup&gt;1&lt;/sup&gt;. Otherwise, the traditional method is to use `model.matrix()` to create a matrix. However, there are some caveats to this that can make things difficult. 

We'll show another method for making them shortly. 

.footnote[[1] _Almost always_ at least. Tree- and rule-based model functions do not. Examples are `randomforest`, `ranger`, `rpart`, `C50`, `Cubist`, `klaR::NaiveBayes` and others.]

???
Caveats include new (unseen) levels of a predictor value.

---

# Infrequent Levels in Categorical Factors

.pull-left[
One issue is: what happens when there are very few values of a level? 

Consider the Ames training set and the `Neighborhood` variable.

If these data are resampled, what would happen to Landmark and similar locations when dummy variables are created?
]
.pull-right[
&lt;img src="images/part-3-ames-hood-1.svg" width="100%" style="display: block; margin: auto;" /&gt;
]

???
Bring up the idea that these issues are model-dependent and something like trees wouldn't care. 

Mention the alley variable and how almost all properties have no alley access. 

Talk about near-zero-variance predictors. 

---

# Infrequent Levels in Categorical Factors

A _zero-variance_ predictor that has only a single value (zero) would be the result. 

Many models (e.g. linear/logistic regression, etc.) would find this numerically problematic and issue a warning and `NA` values for that coefficient. Trees and similar models would not notice. 

There are two main approaches to dealing with this: 

 * Run a filter on the training set predictors prior to running the model and remove the zero-variance predictors.
 
 * Recode the factor so that infrequently occurring predictors (and possibly new values) are pooled into an "other" category. 
 
However, `model.matrix()` and the formula method are incapable of helping you. 


---

# Recipes &lt;img src="images/recipes.png" class="title-hex"&gt;

Recipes are an alternative method for creating the data frame of predictors for a model. They allow for a sequence of _steps_ that define how data should be handled. 

Recall the previous part where we used the formula `log10(Sale_Price) ~ Longitude + Latitude`? These steps are:

.pull-left-a-little[

* Assign `Sale_Price` to be the outcome

* Assign `Longitude` and `Latitude` as predictors

* Log transform the outcome

]

.pull-right-a-lot[

To start using a recipe, these steps can be done using 


```r
# recipes loaded by tidymodels
mod_rec &lt;- recipe(Sale_Price ~ Longitude + Latitude, ames_train) %&gt;%
  step_log(Sale_Price, base = 10)
```

This creates the recipe for data processing (but does not execute it yet)

]

---

# Recipes and Categorical Predictors &lt;img src="images/recipes.png" class="title-hex"&gt;

To deal with the dummy variable issue, we can expand the recipe with more steps:

.pull-left[


```r
mod_rec &lt;- recipe(
    Sale_Price ~ Longitude + Latitude + Neighborhood, 
    data = ames_train
  ) %&gt;%
  
  step_log(Sale_Price, base = 10) %&gt;%
  
  # Lump factor levels that occur in 
  # &lt;= 5% of data as "other"
  step_other(Neighborhood, threshold = 0.05) %&gt;%
  
  # Create dummy variables for _any_ factor variables
  step_dummy(all_nominal())
```

]

.pull-right[


```r
mod_rec
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
## 
## Operations:
## 
## Log transformation on Sale_Price
## Collapsing factor levels for Neighborhood
## Dummy variables from all_nominal
```

]

Note that we can use standard `dplyr` selectors as well as some new ones based on the data type (`all_nominal()`) or by their role in the analysis (`all_predictors()`).

---
# Using Recipes &lt;img src="images/recipes.png" class="title-hex"&gt;

&lt;br&gt;

&lt;br&gt;


&lt;img src="images/recipes-process.svg" width="70%" style="display: block; margin: auto;" /&gt;


---

# Preparing the Recipe &lt;img src="images/recipes.png" class="title-hex"&gt;

Now that we have a preprocessing _specification_, let's run it on the training set to _prepare_ the recipe:


```r
mod_rec_trained &lt;- prep(mod_rec, training = ames_train, verbose = TRUE)
```

```
## oper 1 step log [training] 
## oper 2 step other [training] 
## oper 3 step dummy [training] 
## The retained training set is ~ 0.19 Mb  in memory.
```

Here, the "training" is to determine which levels to lump together and to enumerate the factor levels of the `Neighborhood` variable.

---

# Preparing the Recipe &lt;img src="images/recipes.png" class="title-hex"&gt;


```r
mod_rec_trained
```

```
## Data Recipe
## 
## Inputs:
## 
##       role #variables
##    outcome          1
##  predictor          3
## 
## Training data contained 2199 data points and no missing data.
## 
## Operations:
## 
## Log transformation on Sale_Price [trained]
## Collapsing factor levels for Neighborhood [trained]
## Dummy variables from Neighborhood [trained]
```

---

# Getting the Values - Training &lt;img src="images/recipes.png" class="title-hex"&gt;

Now that the recipe has been prepared, we can extract the processed training set from it, with all of the steps applied. To do that, we use `juice()`.


```r
# Extracts processed version of `ames_train`
juice(mod_rec_trained)
```

```
## # A tibble: 2,199 x 11
##    Longitude Latitude Sale_Price Neighborhood_Co… Neighborhood_Ol… Neighborhood_Ed… Neighborhood_So…
##        &lt;dbl&gt;    &lt;dbl&gt;      &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;
##  1     -93.6     42.1       5.24                0                0                0                0
##  2     -93.6     42.1       5.39                0                0                0                0
##  3     -93.6     42.1       5.28                0                0                0                0
##  4     -93.6     42.1       5.29                0                0                0                0
##  5     -93.6     42.1       5.33                0                0                0                0
##  6     -93.6     42.1       5.28                0                0                0                0
##  7     -93.6     42.1       5.37                0                0                0                0
##  8     -93.6     42.1       5.28                0                0                0                0
##  9     -93.6     42.1       5.25                0                0                0                0
## 10     -93.6     42.1       5.26                0                0                0                0
## # … with 2,189 more rows, and 4 more variables: Neighborhood_Northridge_Heights &lt;dbl&gt;,
## #   Neighborhood_Gilbert &lt;dbl&gt;, Neighborhood_Sawyer &lt;dbl&gt;, Neighborhood_other &lt;dbl&gt;
```

This is what you'd pass on to `fit()` your model.

---

# Getting the Values - Testing &lt;img src="images/recipes.png" class="title-hex"&gt;

After model fitting, you'll eventually want to make predictions on _new data_. But first, you have to reapply all of the pre-processing steps on it. To do that, use `bake()`.


```r
bake(mod_rec_trained, new_data = ames_test)
```

```
## # A tibble: 731 x 11
##    Longitude Latitude Sale_Price Neighborhood_Co… Neighborhood_Ol… Neighborhood_Ed… Neighborhood_So…
##        &lt;dbl&gt;    &lt;dbl&gt;      &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;            &lt;dbl&gt;
##  1     -93.6     42.1       5.33                0                0                0                0
##  2     -93.6     42.1       5.02                0                0                0                0
##  3     -93.6     42.1       5.27                0                0                0                0
##  4     -93.6     42.1       5.60                0                0                0                0
##  5     -93.6     42.1       5.28                0                0                0                0
##  6     -93.6     42.1       5.17                0                0                0                0
##  7     -93.6     42.1       5.02                0                0                0                0
##  8     -93.7     42.1       5.46                0                0                0                0
##  9     -93.7     42.1       5.44                0                0                0                0
## 10     -93.7     42.1       5.33                0                0                0                0
## # … with 721 more rows, and 4 more variables: Neighborhood_Northridge_Heights &lt;dbl&gt;,
## #   Neighborhood_Gilbert &lt;dbl&gt;, Neighborhood_Sawyer &lt;dbl&gt;, Neighborhood_other &lt;dbl&gt;
```

This is what you'd pass on to `predict()`.

---
class: middle, center

# `juice()` is used to get the pre-processed training set &lt;br&gt; (basically for free)

# `bake()` is used to pre-process a _new_ data set

---

# How Data Are Used &lt;img src="images/recipes.png" class="title-hex"&gt;

Note that we have:


```r
recipe(..., data = data_set)
prep(...,   training = data_set)
bake(...,   new_data = data_set)
```

* `recipe()` - `data` is used _only_ to determine column names and types. A 0-row data frame could even be used.

* `prep()` - `training` is the entire training set, used to estimate parameters in each step (like means or standard deviations).

* `bake()` - `new_data` is data to apply the pre-processing to, using the _same estimated parameters_ from when `prep()` was called on the training set.

---

# Hands-On: Zero-Variance Filter

Instead of using `step_other()`, take 10 minutes and research how to eliminate any zero-variance predictors using the [`recipe` reference site](https://tidymodels.github.io/recipes/reference/index.html).

Re-run the recipe with this step. 

What were the results? 

Do you prefer either of these approaches to the other? 

<div class="countdown" id="timer_5e2e3155" style="bottom:0;left:1;" data-warnwhen="0">
<code class="countdown-time"><span class="countdown-digits minutes">10</span><span class="countdown-digits colon">:</span><span class="countdown-digits seconds">00</span></code>
</div>


---
layout: false
class: inverse, middle, center

#  Interaction Effects


---

# Interactions &lt;img src="images/ggplot2.png" class="title-hex"&gt;

An **interaction** between two predictors indicates that the relationship between the predictors and the outcome cannot be describe using only one of the variables. 

For example, let's look at the relationship between the price of a house and the year in which it was built. The relationship appears to be slightly nonlinear, possibly quadratic:

.pull-left[

```r
price_breaks &lt;- (1:6)*(10^5)

ames_train %&gt;%
  ggplot(aes(x = Year_Built, y = Sale_Price)) + 
  geom_point(alpha = 0.4) +
  scale_y_log10() + 
  geom_smooth(method = "loess")
```
]

.pull-right[
&lt;img src="images/part-3-year-built-plot-1.svg" width="80%" style="display: block; margin: auto;" /&gt;
]


---

# Interactions &lt;img src="images/ggplot2.png" class="title-hex"&gt;

However... what if we separate this trend based on whether the property has air conditioning or not.

.pull-left[


```r
ames_train %&gt;%
  group_by(Central_Air) %&gt;%
  summarise(n = n()) %&gt;%
  mutate(percent = n / sum(n) * 100)
```

```
## # A tibble: 2 x 3
##   Central_Air     n percent
##   &lt;fct&gt;       &lt;int&gt;   &lt;dbl&gt;
## 1 N             141    6.41
## 2 Y            2058   93.6
```


```r
# to get robust linear regression model
library(MASS)

ames_train %&gt;%
  ggplot(aes(x = Year_Built, y = Sale_Price)) + 
  geom_point(alpha = 0.4) +
  scale_y_log10() + 
  facet_wrap(~ Central_Air, nrow = 2) +
  geom_smooth(method = "rlm") 
```

]

.pull-right[

&lt;img src="images/part-3-year-built-ac-plot-1.svg" width="100%" style="display: block; margin: auto;" /&gt;

]

---

# Interactions

It appears as though the relationship between the year built and the sale price is somewhat _different_ for the two groups. 

* When there is no AC, the trend is perhaps flat or slightly decreasing.
* With AC, there is a linear increasing trend or is perhaps slightly quadratic with some outliers at the low end. 


```r
mod1 &lt;- lm(log10(Sale_Price) ~ Year_Built + Central_Air,                          data = ames_train)
mod2 &lt;- lm(log10(Sale_Price) ~ Year_Built + Central_Air + Year_Built:Central_Air, data = ames_train)
anova(mod1, mod2)
```

```
## Analysis of Variance Table
## 
## Model 1: log10(Sale_Price) ~ Year_Built + Central_Air
## Model 2: log10(Sale_Price) ~ Year_Built + Central_Air + Year_Built:Central_Air
##   Res.Df    RSS Df Sum of Sq      F   Pr(&gt;F)    
## 1   2196 42.741                                 
## 2   2195 41.733  1    1.0075 52.993 4.64e-13 ***
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
```


---

# Interactions in Recipes &lt;img src="images/recipes.png" class="title-hex"&gt;

We first create the dummy variables for the qualitative predictor (`Central_Air`) then use a formula to create the interaction using the `:` operator in an additional step:


```r
interact_rec &lt;- recipe(Sale_Price ~ Year_Built + Central_Air, data = ames_train) %&gt;%
  step_log(Sale_Price) %&gt;%
  step_dummy(Central_Air) %&gt;%
  step_interact(~ starts_with("Central_Air"):Year_Built)
  
interact_rec %&gt;%
  prep(training = ames_train) %&gt;%
  juice() %&gt;%
  # select a few rows with different values
  slice(153:157)
```

```
## # A tibble: 5 x 4
##   Year_Built Sale_Price Central_Air_Y Central_Air_Y_x_Year_Built
##        &lt;int&gt;      &lt;dbl&gt;         &lt;dbl&gt;                      &lt;dbl&gt;
## 1       1915       11.9             1                       1915
## 2       1912       12.0             1                       1912
## 3       1920       11.7             1                       1920
## 4       1963       11.6             0                          0
## 5       1930       10.9             0                          0
```


---
layout: false
class: inverse, middle, center

# Principal Component Analysis

---

# A Bivariate Example

.pull-left[
.font120[

The plot on the right shows two **predictors** from a real _test_ set where the objective is to predict the two classes. 

The predictors are strongly correlated and each has a right-skewed distribution. 

There appears to be some class separation but only in the bivariate plot; the individual predictors show poor discrimination of the classes. 

Some models might be sensitive to highly correlated and/or skewed predictors. 

Is there something that we can do to make the predictors _easier for the model to use_?

***Any ideas***?

]
]

.pull-right[

&lt;img src="images/part-3-bivariate-plot-natural-1.svg" width="90%" style="display: block; margin: auto;" /&gt;

]

???
Skewed, positive data =&gt; ratios (to me)

Mention the use of logistic regression and ROC curve results.

These data are in the github repo


---

# A Bivariate Example


.pull-left[
.font110[
We might start by estimating transformations of the predictors to resolve the skewness. 

The Box-Cox transformation is a family of transformations originally designed for the outcomes of models. We can use it here for the predictors. 

It uses the data to estimate a wide variety of transformations including the inverse, log, sqrt, and polynomial functions. 

Using each factor in isolation, both predictors were determined to need inverse transformations (approximately). 

The figure on the right shows the data after these transformations have been applied. 

A logistic regression model shows a substantial improvement in classifying using the altered data. 
]
]
.pull-right[
&lt;img src="images/part-3-bivariate-plot-inverse-1.svg" width="90%" style="display: block; margin: auto;" /&gt;
]

???
The reason to show the _test data_ (or some other data) is to emphasize that we can visually overfit by reapplying the model to the data (as before).

Ordinarily, we would not use the test set in this way. When there is enough data, use a random sample (like a validation set) to evaluate the changes. 


---

# More Recipe Steps &lt;img src="images/recipes.png" class="title-hex"&gt;

The package has a [rich set](https://tidymodels.github.io/recipes/reference/index.html) of steps that can be used including transformations, filters, variable creation and removal, dimension reduction procedures, imputation, and others. 

There are also packages like [`embed`](	https://tidymodels.github.io/embed), [`textrecipes`](	https://tidymodels.github.io/textrecipes), and [`themis`](	https://tidymodels.github.io/themis) that extend recipes with new steps.

For example, in the previous bivariate data problem, the Box-Cox transformation was conducted using:


```r
bivariate_rec &lt;- recipe(Class ~ ., data = bivariate_data_train) %&gt;%
  step_BoxCox(all_predictors())

bivariate_rec &lt;- prep(bivariate_rec, training = bivariate_data_train, verbose = FALSE)

inverse_test_data &lt;- bake(bivariate_rec, new_data = bivariate_data_test)
```

???
Show `tidy` method to get the lambda values


---

# Correlated Predictors

In the Ames data, there are potential clusters of _highly correlated variables_: 

 * proxies for size: `Lot_Area`, `Gr_Liv_Area`, `First_Flr_SF`, `Bsmt_Unf_SF`, `Full_Bath` etc. 
 
 * quality fields: `Overall_Qual`, `Garage_Qual`, `Kitchen_Qual`, `Exter_Qual`, etc. 
 
It would be nice if we could combine/amalgamate the variables in these clusters into a single variable that represents them. 

Another way of putting this is that we would like to create artificial features of the data that account for a certain amount of _variation_ in the data. 

There are a few different methods that can accomplish this; we will focus on principal component analysis (PCA). Another, regularization, will be discussed later. 


---

# PCA Signal Extraction 

Principal component analysis (PCA) is a multivariate statistical technique that can be used to create artificial new variables from an existing set. 

Conceptually, PCA determines which variables account for the most correlation in the data and creates a new variable that is a linear combination of all the predictors. 

* This is called the _first principal component_ (aka `PC1`).

* This linear combination emphasizes the variables that are the most correlated. 

The variables constructing `PC1` are then _removed from the data_. 

The second PCA component is the linear combination that accounts for the most left-over correlation in the data (and so on). 

---

# PCA Signal Extraction 

The main takeaways:

* The components account for as much as the variation in the original data as possible.

* Each component is uncorrelated with the others. 

* The new variables are _linear combinations_ of all of the input variables and are effectively unitless (It is generally a good idea to center and scale your predictors because of this).


For our purposes, we would use PCA on the _predictors_ to:

* Reduce the number of variables exposed to the model (but this is not feature selection).

* Combat excessive correlations between the predictors (aka multicollinearity).

In this way, the procedure is often called _signal extraction_ but this is poorly named since there is no guarantee that the new variables will have an association with the outcome. 

???
 No feature selection due to linear combinations


---

# Back to the Bivariate Example - Transformed Data

&lt;img src="images/part-3-bivariate-rec-orig-1.svg" width="40%" style="display: block; margin: auto;" /&gt;

---

# Back to the Bivariate Example - Recipes &lt;img src="images/recipes.png" class="title-hex"&gt;&lt;img src="images/ggplot2.png" class="title-hex"&gt;

We can build on our transformed data recipe and add normalization: 


```r
bivariate_pca &lt;- 
  recipe(Class ~ PredictorA + PredictorB, data = bivariate_data_train) %&gt;%
  step_BoxCox(all_predictors()) %&gt;%
  step_normalize(all_predictors()) %&gt;% # center and scale
  step_pca(all_predictors()) %&gt;%
  prep(training = bivariate_data_train)

pca_test &lt;- bake(bivariate_pca, new_data = bivariate_data_test)

# Put components axes on the same range
pca_rng &lt;- extendrange(c(pca_test$PC1, pca_test$PC2))

pca_test %&gt;%
  ggplot(aes(x = PC1, y = PC2, color = Class)) +
  geom_point(alpha = .2, cex = 1.5) + 
  theme(legend.position = "top") +
  scale_colour_calc() +
  xlim(pca_rng) + ylim(pca_rng) + 
  xlab("Principal Component 1") + 
  ylab("Principal Component 2")
```


???
Order matters; Box-Cox before centering; 

YJ transformation


---

# Back to the Bivariate Example

.pull-left[
.font120[

Recall that even after the Box-Cox transformation was applied to our previous example, there was still a high degree of correlation between the predictors. 

After the transformation, the predictors were centered and scaled, then PCA was conducted. The plot on the right shows the results. 

Since these two predictors are highly correlated, the first component captures 91.7% of the variation in the original data. However...

...recall that PCA does not guarantee that the components are associated with the outcome. In this example, the _least important_ component has the association with the outcome.  

]
]
.pull-right[

&lt;img src="images/part-3-bivariate-plot-pca-1.svg" width="90%" style="display: block; margin: auto;" /&gt;

]


---
class: middle, center


# PCA does a _rotation_ of the data so that the _variation_ in one dimension is maximized. 

# The rotation also makes the new variables _uncorrelated_.

---
class: middle, center


&lt;img src="images/rotate.gif" width="40%" /&gt;

---
layout: false
class: inverse, middle, center

# Recipe and Models


---

# Longitude &lt;img src="images/ggplot2.png" class="title-hex"&gt;

.pull-left[


```r
ggplot(ames_train, 
       aes(x = Longitude, y = Sale_Price)) + 
  geom_point(alpha = .5) + 
  geom_smooth(
    method = "lm", 
    formula = y ~ splines::bs(x, 5), 
    se = FALSE
  ) + 
  scale_y_log10()
```

Splines add nonlinear versions of the predictor to a linear model to create smooth and flexible relationships between the predictor and outcome.

This "basis expansion" technique will be seen again in the regression section of the workshop. 

]

.pull-right[

&lt;img src="images/part-3-longitude-1.svg" width="100%" style="display: block; margin: auto;" /&gt;

]


---

# Latitude &lt;img src="images/ggplot2.png" class="title-hex"&gt;

.pull-left[


```r
ggplot(ames_train, 
       aes(x = Latitude, y = Sale_Price)) + 
  geom_point(alpha = .5) + 
  geom_smooth(
    method = "lm", 
    formula = y ~ splines::ns(x, df = 5), 
    se = FALSE
  ) + 
  scale_y_log10()
```

]

.pull-right[

&lt;img src="images/part-3-latitude-1.svg" width="100%" style="display: block; margin: auto;" /&gt;

]

---

# Linear Models Again &lt;img src="images/recipes.png" class="title-hex"&gt;

.pull-left-a-little[

* We'll add neighborhood in as well and a few other house features. 

* Our plots suggests that the coordinates can be helpful but probably require a nonlinear representation. We can add these using _B-splines_ with 5 degrees of freedom.

]

.pull-right-a-lot[

* Two numeric predictors are very skewed and could use a transformation (`Lot_Area` and `Gr_Liv_Area`).


```r
ames_rec &lt;- recipe(
    Sale_Price ~ Bldg_Type + Neighborhood + Year_Built + 
                 Gr_Liv_Area + Full_Bath + Year_Sold + Lot_Area +
                 Central_Air + Longitude + Latitude,
    data = ames_train
  ) %&gt;%
  step_log(Sale_Price, base = 10) %&gt;%
  step_BoxCox(Lot_Area, Gr_Liv_Area) %&gt;%
  step_other(Neighborhood, threshold = 0.05)  %&gt;%
  step_dummy(all_nominal()) %&gt;%
  step_interact(~ starts_with("Central_Air"):Year_Built) %&gt;%
  step_ns(Longitude, Latitude, deg_free = 5)
```

]


---

# Combining the Recipe with a Model &lt;img src="images/recipes.png" class="title-hex"&gt;&lt;img src="images/parsnip.png" class="title-hex"&gt;&lt;img src="images/broom.png" class="title-hex"&gt;

- `prep()` - `juice()` - `fit()`


```r
ames_rec &lt;- prep(ames_rec)

lm_fit &lt;- 
  lm_mod %&gt;% 
  fit(Sale_Price ~ ., data = juice(ames_rec))   # The recipe puts Sale_Price on the log scale

glance(lm_fit$fit)
```

```
## # A tibble: 1 x 11
##   r.squared adj.r.squared  sigma statistic p.value    df logLik    AIC    BIC deviance df.residual
##       &lt;dbl&gt;         &lt;dbl&gt;  &lt;dbl&gt;     &lt;dbl&gt;   &lt;dbl&gt; &lt;int&gt;  &lt;dbl&gt;  &lt;dbl&gt;  &lt;dbl&gt;    &lt;dbl&gt;       &lt;int&gt;
## 1     0.802         0.799 0.0800      303.       0    30  2448. -4834. -4657.     13.9        2169
```

- `bake()` - `predict()`


```r
ames_test_processed &lt;- bake(ames_rec, ames_test, all_predictors())

# but let's not do this
# predict(lm_fit, new_data = ames_test_processed)
```

---

# That Modeling Process Again


.pull-left[

We are 

 * estimating transformations on some variables, 
 
 * making nonlinear features for the geocoding variables, and 
 
 * estimating new encodings for some factor variables. 

These should definitely be included in our overall modeling process. 

]
.pull-right[

&lt;img src="images/diagram-packages.svg" width="120%" style="display: block; margin: auto;" /&gt;

]

---

# Workflows

The `workflows` package enables a handy type of object that can bundle pre-processing and models together. 

- You don’t have to keep track of separate objects in your workspace.

- The recipe prepping and model fitting can be executed using a single call to `fit()` instead of `prep()`-`juice()`-`fit()`.

- The recipe baking and model predictions are handled with a single call to `predict()` instead of `bake()`-`predict()`.

- Workflows _will_ be able to add post-processing operations in upcoming versions. An example of post-processing would be to modify (and tune) the probability cutoff for two-class models. 

---

# Workflows &lt;img src="images/recipes.png" class="title-hex"&gt;&lt;img src="images/parsnip.png" class="title-hex"&gt;


```r
ames_wfl &lt;- workflow() %&gt;% 
  add_recipe(ames_rec) %&gt;% 
  add_model(lm_mod)

ames_wfl
```

```
## ══ Workflow ═════════════════════════════════════════════════════════════════════════════════════════
## Preprocessor: Recipe
## Model: linear_reg()
## 
## ── Preprocessor ─────────────────────────────────────────────────────────────────────────────────────
## 6 Recipe Steps
## 
## ● step_log()
## ● step_BoxCox()
## ● step_other()
## ● step_dummy()
## ● step_interact()
## ● step_ns()
## 
## ── Model ────────────────────────────────────────────────────────────────────────────────────────────
## Linear Regression Model Specification (regression)
## 
## Computational engine: lm
```

---

# 1-Step fitting and predicting &lt;img src="images/parsnip.png" class="title-hex"&gt;


```r
# preps() `ames_train`, juice()s it, then fit()s model
ames_wfl_fit &lt;- fit(ames_wfl, ames_train)

# bake()s `ames_test`, then predict()s
predict(ames_wfl_fit, ames_test) %&gt;% slice(1:5)
```

```
## # A tibble: 5 x 1
##   .pred
##   &lt;dbl&gt;
## 1  5.32
## 2  5.10
## 3  5.14
## 4  5.42
## 5  5.32
```

 * `add_formula()` can be used as a simple alternative to `add_recipe()`. 
 
 * A _secondary formula_ can be given to `add_model()` that will be directly passed to the model. This is helpful when recipes are used with  GAMs or mixed-models.  
 
 * There are `remove_*()` and `update_*()` functions too. 

---

# A Quick Knowledge Check - Match tasks to packages

.pull-left[

* Fit a K-NN model
 
* Extract holidays from dates

* Make a training/test split
 
* Bundle a recipe and model

* Is high in vitamin A
 
* Compute R&lt;sup&gt;2&lt;/sup&gt;
 
* Bin a predictor (but seriously,...don't)

]

.pull-right[

* `workflows`
 
* `yardstick`
 
* `carrot`
 
* `recipes`
 
* `parsnip`
 
* `rsample`
 
* `ggvis`

]
    </textarea>
<style data-target="print-only">@media screen {.remark-slide-container{display:block;}.remark-slide-scaler{box-shadow:none;}}</style>
<script src="https://remarkjs.com/downloads/remark-latest.min.js"></script>
<script src="https://platform.twitter.com/widgets.js"></script>
<script>var slideshow = remark.create({
"highlightStyle": "solarized-light",
"highlightLanguage": "R",
"highlightLines": true,
"countIncrementalSlides": false,
"ratio": "16:9"
});
if (window.HTMLWidgets) slideshow.on('afterShowSlide', function (slide) {
  window.dispatchEvent(new Event('resize'));
});
(function(d) {
  var s = d.createElement("style"), r = d.querySelector(".remark-slide-scaler");
  if (!r) return;
  s.type = "text/css"; s.innerHTML = "@page {size: " + r.style.width + " " + r.style.height +"; }";
  d.head.appendChild(s);
})(document);

(function(d) {
  var el = d.getElementsByClassName("remark-slides-area");
  if (!el) return;
  var slide, slides = slideshow.getSlides(), els = el[0].children;
  for (var i = 1; i < slides.length; i++) {
    slide = slides[i];
    if (slide.properties.continued === "true" || slide.properties.count === "false") {
      els[i - 1].className += ' has-continuation';
    }
  }
  var s = d.createElement("style");
  s.type = "text/css"; s.innerHTML = "@media print { .has-continuation { display: none; } }";
  d.head.appendChild(s);
})(document);
// delete the temporary CSS (for displaying all slides initially) when the user
// starts to view slides
(function() {
  var deleted = false;
  slideshow.on('beforeShowSlide', function(slide) {
    if (deleted) return;
    var sheets = document.styleSheets, node;
    for (var i = 0; i < sheets.length; i++) {
      node = sheets[i].ownerNode;
      if (node.dataset["target"] !== "print-only") continue;
      node.parentNode.removeChild(node);
    }
    deleted = true;
  });
})();
// adds .remark-code-has-line-highlighted class to <pre> parent elements
// of code chunks containing highlighted lines with class .remark-code-line-highlighted
(function(d) {
  const hlines = d.querySelectorAll('.remark-code-line-highlighted');
  const preParents = [];
  const findPreParent = function(line, p = 0) {
    if (p > 1) return null; // traverse up no further than grandparent
    const el = line.parentElement;
    return el.tagName === "PRE" ? el : findPreParent(el, ++p);
  };

  for (let line of hlines) {
    let pre = findPreParent(line);
    if (pre && !preParents.includes(pre)) preParents.push(pre);
  }
  preParents.forEach(p => p.classList.add("remark-code-has-line-highlighted"));
})(document);</script>

<script>
(function() {
  var links = document.getElementsByTagName('a');
  for (var i = 0; i < links.length; i++) {
    if (/^(https?:)?\/\//.test(links[i].getAttribute('href'))) {
      links[i].target = '_blank';
    }
  }
})();
</script>

<script>
slideshow._releaseMath = function(el) {
  var i, text, code, codes = el.getElementsByTagName('code');
  for (i = 0; i < codes.length;) {
    code = codes[i];
    if (code.parentNode.tagName !== 'PRE' && code.childElementCount === 0) {
      text = code.textContent;
      if (/^\\\((.|\s)+\\\)$/.test(text) || /^\\\[(.|\s)+\\\]$/.test(text) ||
          /^\$\$(.|\s)+\$\$$/.test(text) ||
          /^\\begin\{([^}]+)\}(.|\s)+\\end\{[^}]+\}$/.test(text)) {
        code.outerHTML = code.innerHTML;  // remove <code></code>
        continue;
      }
    }
    i++;
  }
};
slideshow._releaseMath(document);
</script>
<!-- dynamically load mathjax for compatibility with self-contained -->
<script>
(function () {
  var script = document.createElement('script');
  script.type = 'text/javascript';
  script.src  = 'https://mathjax.rstudio.com/latest/MathJax.js?config=TeX-MML-AM_CHTML';
  if (location.protocol !== 'file:' && /^https?:/.test(script.src))
    script.src  = script.src.replace(/^https?:/, '');
  document.getElementsByTagName('head')[0].appendChild(script);
})();
</script>
  </body>
</html>