-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathintroduction.Rmd
275 lines (186 loc) · 11.7 KB
/
introduction.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
# (PART\*) Part I: Concepts {-}
```{r setupIntro, include=FALSE}
# knitr::opts_chunk$set(cache.rebuild=F, cache = T)
```
# Introduction
## Beyond the General Linear Model I
### General Linear Model
Let's get started by considering the *standard linear regression model* (SLiM) estimated via ordinary least squares (OLS). We have some response or target variable we wish to study, and believe it to be some function of other variables. In terms of the underlying data generating process, $y$ is the variable of interest, assumed to be normally distributed with mean $\mu$ and variance $\sigma^2$, and the Xs are the features/covariates in this scenario. The features are multiplied by the weights/coefficients ($\beta$) and summed, giving us the linear predictor, which in this case also directly provides us the estimated fitted values. The following is a variant of the way this type of model is presented in introductory texts.
$$y\sim \mathcal{N}(\mu,\sigma^2)$$
$$\mu = b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+ b_p\cdot x_p$$
Here is an example of how the R code would look like for such a model.
```{r examplelmcode, eval=F}
mymod = lm(y ~ x1 + x2, data = mydata)
```
One of the issues with this model is that, in its basic form it can be very limiting in its assumptions about the data generating process for the variable we want to study. It also very typically will not capture what is going on in a great many data situations.
### Generalized Linear Model
In that light, we may consider the *generalized linear model*. Generalized linear models incorporate other types of distributions[^expofam], and include a link function $g(.)$ relating the mean $\mu$, or stated differently, the expected values $E(y)$, to the linear predictor $X\beta$, often denoted $\eta$. So the general form is:
$$g(\mu) = \eta = X\beta$$
$$E(y) = \mu = g^{-1}(\eta)$$
Consider again the typical linear regression. In that situation, we assume a Gaussian (i.e. normal) distribution for the response, we assume equal variance for all observations, and that there is a direct link of the linear predictor and the expected value $\mu$, i.e. $\mu = X\beta$. As such, the typical linear regression model is a generalized linear model with a Gaussian distribution and 'identity' link function.
To further illustrate the generalization, we consider a distribution other than the Gaussian. Here we'll examine a Poisson distribution for some vector of count data. There is only one parameter to be considered, $\mu$, since for the Poisson the mean and variance are equal. For the Poisson, the (canonical) link function $g(.)$, is the natural log, and so relates the log of $\mu$ to the linear predictor. As such we could also write it in terms of exponentiating the right-hand side.
$$y \sim \mathcal{P}(\mu)$$
$$\ln(\mu) = b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+ b_p\cdot x_p$$
$$\mu = e^{b_0+b_1\cdot x_1+b_2\cdot x_2\;...\;+b_p\cdot x_p}$$
While there is a great deal to further explore with regard to generalized linear models, the point here is to simply to note them as a generalization of the typical linear model that we would all be familiar after an introductory course in modeling/statistics. As we eventually move to generalized additive models, we can see them as a subsequent step in the generalization.
### Generalized Additive Model
Now let us make another generalization to incorporate nonlinear forms of the features, via a *generalized additive model*. This form gives the new setup relating our new, now nonlinear predictor to the expected value, with whatever link function may be appropriate.
$$
y \sim ExpoFam(\mu, etc.) \\
E(y) = \mu \\
g(\mu) = b_0 + f(x_1) + f(x_2) \;...\;+f(x_p)
$$
So what's the difference? In short, we are using smooth functions of our feature variables, which can take on a great many forms, with more detail on what that means in the following section. For now, it is enough to note the observed values $y$ are assumed to be of some exponential family distribution, and $\mu$ is still related to the model predictors via a link function. The key difference is that the linear predictor now incorporates smooth functions of at least some (possibly all) features, represented as $f(x)$, and this will allow for nonlinear relationships between the features and the target variable $y$.
## Beyond the General Linear Model II
### Fitting the Standard Linear Model
In many data situations, the relationship between some covariate(s) $X$ and response $y$ is not so straightforward. Consider the following plot. Attempting to fit a standard OLS regression results in the blue line, which doesn't capture the relationship as well as we would like.
```{r fitPlotsDataSetup, echo=F}
set.seed(666)
x = rgamma(1000, 2)
y = x + x ^ 2 + rnorm(100)
dxy = data.frame(x, y) %>% arrange(x)
fitlm = lm(y ~ x, data = dxy)$fitted
fitx2 = lm(y ~ poly(x, 2), data = dxy)$fitted
#loess
marker = which(round(sort(x), 2) == 3)
neighborhood = (marker - 100):(marker + 100)
neighbormod = lm(y ~ x, dxy[neighborhood, ])
neighborfit = fitted(neighbormod)
# neighborfit[names(neighborfit)==paste(marker)]
fitat3 = predict(neighbormod, newdat = data.frame(x = 3, y = NA))
```
```{r linearFitPlot, echo=F}
dxy %>%
ggplot(aes(x,y)) +
geom_point() +
geom_line(aes(y = fitlm), size = 1)
```
### Polynomial Regression
One common approach we could undertake is to add a transformation of the predictor $X$, and in this case we might consider a quadratic term such that our model looks something like the following.
$$
y \sim \mathcal{N}(\mu,\sigma^2)\\
\mu = b_0 + b_1\cdot x_1+b_2\cdot x_1^2
$$
And here is how the model would fit the data.
```{r quadraticFitPlot, echo=F}
dxy %>%
ggplot(aes(x,y)) +
geom_point() +
geom_line(aes(y = fitx2), size = 1)
```
<br>
We haven't really moved from the standard linear model in this case, but we have a much better fit to the data as evidenced by the graph.
### Scatterplot Smoothing
There are still other possible routes we could take. Many are probably familiar with *loess* (or lowess) regression, if only in the sense that often statistical packages may, either by default or with relative ease, add a nonparametric fit line to a scatterplot. By default, this is typically *lowess*, or *locally weighted scatterplot smoothing*.
Take a look at the following figure. For every (sorted) $x_0$ value, we choose a neighborhood around it and, for example, fit a simple regression. As an example, let's look at $x_0$ = 3.0, and choose a neighborhood of 100 X values below and 100 values above.
```{r loessgraph1, echo=FALSE}
# because plotly can't do calculations on the fly for some things
minx = min(sort(x)[neighborhood])
maxx = max(sort(x)[neighborhood])
maxy = max(y) + 2
dxy %>%
ggplot(aes(x, y)) +
geom_point(color = 'gray', alpha = .25) +
geom_vline(aes(xintercept = x), data = . %>% slice(neighborhood) %>% slice(1)) +
geom_vline(aes(xintercept = x), data = . %>% slice(neighborhood) %>% slice_tail(n =1)) +
geom_point(color = '#B2001D', data = . %>% slice(neighborhood))
```
<br>
Now, for just that range we fit a linear regression model and note the fitted value where $x_0=3.0$.
```{r loessgraph2, echo=FALSE}
dxy %>%
slice(neighborhood) %>%
ggplot(aes(x, y)) +
geom_point(color = '#B2001D', alpha = .25) +
geom_smooth(method = 'lm', se = FALSE, formula = y ~ x)
```
<br>
If we now do this for every $x_0$ value, we'll have fitted values based on a rolling neighborhood that is relative to each value being considered. Other options we could throw into this process, and usually do, would be to fit a polynomial regression for each neighborhood, and weight observations the closer they are to the value, with less weight given to more distant observations.
```{r loessgraph3, echo=FALSE, cache=FALSE}
dxy %>%
ggplot(aes(x, y)) +
geom_point(alpha = .25) +
geom_smooth(method = 'lm', se = FALSE, formula = y ~ x) +
annotate(x = 10, y = 50, geom = 'text', label = 'lm', color = okabe_ito[2]) +
geom_smooth(method = 'lm', se = FALSE, formula = y ~ x) +
geom_smooth(method = 'loess', color = okabe_ito[3], se = FALSE, formula = y ~ x) +
annotate(x = 8, y = 90, geom = 'text', label = 'loess', color = okabe_ito[3])
```
<br>
The above plot shows the result from such a fitting process. For comparison, the regular regression fit is also provided. Even without using a lowess approach, we could have fit a model assuming a quadratic relationship, $y = x + x^2$, and it would result in a far better fit than the simpler model[^yquad]. While polynomial regression might suffice in some cases, the issue is that nonlinear relationships are generally not specified so easily, as we'll see next[^popcrash].
```{r mcycleDataSetup, echo=FALSE}
library(splines)
data(mcycle, package = 'MASS')
attach(mcycle)
polyfit = fitted(lm(accel ~ poly(times, 3)))
nsfit = fitted(lm(accel ~ ns(times, df = 5)))
ssfit = fitted(smooth.spline(times, accel))
lowfit = lowess(times, accel, .2)$y
ksfit = ksmooth(times,
accel,
'normal',
bandwidth = 5,
x.points = times)$y #locpoly(times, accel, kernel='normal', bandwidth=10)
gamfit = fitted(mgcv::gam(accel ~ s(times, bs = "cs"))) #original drops duplicates
mcycle2 = data.frame(mcycle, polyfit, nsfit, ssfit, lowfit, ksfit, gamfit)
detach('mcycle')
mcycle3 =
mcycle2 %>%
pivot_longer(-c(times, accel), names_to = 'variable') %>%
mutate(
variable = factor(
variable,
labels = c(
'GAM',
'Kernel Smoother',
'Lowess',
'Natural Splines',
'Polynomial Regression (cubic)',
'Smoothing Splines'
)
),
variable = fct_relevel(
variable,
'Polynomial Regression (cubic)',
'Natural Splines',
'Kernel Smoother',
'Lowess',
'Smoothing Splines'
)
)
```
### Generalized Additive Models
The next figure regards a data set giving a series of measurements of head acceleration in a simulated motorcycle accident[^mcycle]. Time is in milliseconds, acceleration in g. Here we have data that are probably not going to be captured with simple transformations of the predictors. We can compare various approaches, and the first is a straightforward cubic polynomial regression, which unfortunately doesn't help us much. We could try higher orders, which would help, but in the end we will need something more flexible, and generalized additive models can help us in such cases.
```{r mcycleFigure, fig.asp = .75, echo=FALSE}
mcycleFigure = ggplot(aes(x = times, y = accel), data = mcycle3) +
geom_point(color = '#D55E00',
alpha = .25,
size = 1) +
geom_line(aes(y = value), color = '#56B4E9', size = .75) +
facet_wrap( ~ variable) +
ylab('Acceleration') +
xlab('Time') +
theme(plot.background = element_blank())
mcycleFigure
# ggplotly(mcycleFigure) %>% # unlike plot_ly, ggplotly can't handle % width, but also just randomly disappears
# theme_plotly()
```
## Summary
Let's summarize our efforts up to this point.
- Standard Linear Model
$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$\mu = b_0+b_1\cdot x_1$$
- Polynomial Regression
$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$\mu = b_0+b_1\cdot x_1+b_2\cdot x^2$$
- GLM formulation
$$y\sim \mathcal{N}(\mu, \sigma^2)$$
$$g(\mu) = b_0+b_1\cdot x_1+b_2\cdot x_2$$
- GAM formulation
$$y\sim \mathcal{N}(\mu, \sigma^{2})$$
$$g(\mu) = f(X)$$
Now that some of the mystery will hopefully have been removed, let's take a look at GAMs in action.
[^expofam]: Of the [exponential family](https://en.wikipedia.org/wiki/Exponential_family).
[^yquad]: $y$ was in fact constructed as $x + x^2 +$ noise.
[^popcrash]: Bolker 2008 notes an example of fitting a polynomial to 1970s U.S. Census data that would have predicted a population crash to zero by 2015.
[^mcycle]: This is a version of that found in @venables_modern_2002.