-
Notifications
You must be signed in to change notification settings - Fork 36
/
Copy pathcase_for_gam.Rmd
402 lines (318 loc) · 13.3 KB
/
case_for_gam.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
# The case for GAMs
```{r setupCase, include=FALSE}
# knitr::opts_chunk$set(cache.rebuild=F, cache = T)
```
## Why not just use standard methods?
We have seen that GAMs generalize our typical approaches, but can they really help? The standard linear model is ubiquitous in statistical training and application, and for good reason. It is simple to do and easy to understand. Let's go ahead and do one to get things started with some simulated data.
```{r demolmdat, echo=F}
set.seed(123)
library(mgcv)
dat = gamSim(
1,
n = 400,
dist = "normal",
scale = 1,
verbose = FALSE
)
b = gam(y ~ s(x0) + s(x1) + s(x2) + s(x3), data = dat)
```
```{r demolm, eval=FALSE}
mod = lm(y ~ x1 + x2, data = dat)
summary(mod)
```
```{r lmsummary_clean, echo=FALSE, fig.align='center'}
mod = lm(y ~ x1 + x2, data = dat)
# tidy(mod) %>%
# mutate_if(is.numeric, round, digits=2) %>%
# pander(justify='lcccc')
# glance(mod) %>%
# select(r.squared) %>%
# round(2) %>%
# pander(justify='c')
pander::pander(summary(mod),
justify = 'lrrrr',
caption = '',
round = 1)
```
<br>
Everything is nice and tidy. We have straightforward information, positive effect of `x1`, negative for `x2`, and familiar output. Depending on your context, the R^2^ may or may not be something exciting. Let's look at some diagnostics[^basergraphics].
```{r lmdiag, echo=FALSE, fig.asp=.75}
# total defeat of base R graphics
par(mfrow = c(1, 2))
plot(
mod,
which = 1,
bty = 'n',
pch = 19,
col = scales::alpha('#D55E00', .1),
col.smooth = '#56B4E9',
lwd = 2,
cex = 1.25,
col.lab = 'gray25',
col.axis = 'gray50',
col.sub = 'gray50',
cex.caption = 1,
cex.oma.main = 1.25,
yaxt = 'n'
)
axis(2, col = 'gray75', col.axis = 'gray33')
axis(1, col = 'gray75', col.axis = 'gray33')
title(col.caption = .2)
plot(
mod,
which = 2,
bty = 'n',
pch = 19,
col = scales::alpha('#D55E00', .1),
col.smooth = '#56B4E9',
lwd = 2,
cex = 1.25,
col.lab = 'gray25',
col.axis = 'gray75',
col.sub = 'gray50',
cex.caption = 1,
cex.oma.main = 1.25,
yaxt = 'n'
)
axis(2, col = 'gray75', col.axis = 'gray33')
axis(1, col = 'gray75', col.axis = 'gray33')
# axis(2, col='red')
graphics::layout(1)
```
Some issues might be present, as we might be getting a little more variance with some, especially higher, fitted values. We're also a little loose in the tails of the distribution of the residuals. Let's compare our predictions to the data. With a strong model, we might see a cigar shaped cloud converging to a line with slope 1 as the fit gets better. We seem to be having some issues here, as the previous residual plot seemed to indicate.
```{r plotlmfit, echo=FALSE, fig.asp=.75}
augment(mod) %>%
ggplot(aes(x = .fitted, y = y)) +
geom_point(alpha = .25, color = '#D55E00') +
geom_smooth(se = FALSE, color = '#56B4E9') +
labs(x = 'Fitted value', title = 'Fitted vs. Observed') +
hrbrthemes::theme_ipsum_rc() +
theme(title = element_text(colour = '#585858'))
```
Now let's go back and visualize the data. The following plots both features against the target variable.
```{r covariatePlot, echo=FALSE, fig.asp=.75}
dat %>%
select(x1, x2, y) %>%
pivot_longer(-y, names_to = 'variable', values_to = 'Feature') %>%
ggplot(aes(x = Feature, y = y)) +
geom_point(alpha = .25, color = '#D55E00') +
geom_smooth(aes(), color = '#56B4E9', se = F) +
facet_grid( ~ variable) +
labs(title = 'Features vs. Y') +
hrbrthemes::theme_ipsum_rc() +
theme(
legend.title = element_blank(),
legend.background = element_blank(),
legend.key = element_blank(),
title = element_text(colour = '#585858')
)
```
Yikes. We certainly have a positive effect for x1, but it looks rather curvy. The other feature doesn't appear to have a relationship that could be classified as easily. It is sharply positive for low values of x2, but negative thereafter.
## Heteroscedasticity, non-normality etc.
In many cases as above, people have some standby methods for dealing with the problem. For example, they might see the qq-plot for the residuals and think some of those cases are 'outliers', perhaps even dropping them from analysis. Others might try a transformation of the target variable, for example, in cases of heteroscedasticity (not because of non-normality!) some might take the log.
```{r loglm, eval=FALSE}
modlog = lm(log(y) ~ x1 + x2, dat)
summary(modlog)
```
```{r loglmsummary_clean, echo=FALSE}
modlog = lm(log(y) ~ x1 + x2, dat)
summary(modlog) %>%
pander::pander(justify = 'lrrrr',
aption = '',
round = 2)
```
<br>
Well, our fit in terms of R^2^ has actually gone down. Let's check the diagnostics.
```{r loglmdiag, echo=FALSE, fig.asp=.75}
par(mfrow = c(1, 2))
plot(
modlog,
which = 1,
bty = 'n',
pch = 19,
col = scales::alpha('#D55E00', .1),
col.smooth = '#56B4E9',
lwd = 2,
cex = 1.25,
col.lab = 'gray25',
col.axis = 'gray50',
col.sub = 'gray50',
cex.caption = 1,
cex.oma.main = 1.25,
yaxt = 'n'
)
axis(2, col = 'gray75', col.axis = 'gray33')
axis(1, col = 'gray75', col.axis = 'gray33')
title(col.caption = .2)
plot(
modlog,
which = 2,
bty = 'n',
pch = 19,
col = scales::alpha('#D55E00', .1),
col.smooth = '#56B4E9',
lwd = 2,
cex = 1.25,
col.lab = 'gray25',
col.axis = 'gray75',
col.sub = 'gray50',
cex.caption = 1,
cex.oma.main = 1.25,
yaxt = 'n'
)
axis(2, col = 'gray75', col.axis = 'gray33')
axis(1, col = 'gray75', col.axis = 'gray33')
graphics::layout(1)
```
The transformation may have helped in some ways, but made other things worse.
```{r plotloglmfit, echo=FALSE, fig.asp=.75}
augment(modlog) %>%
ggplot(aes(x = .fitted, y = `log(y)`)) +
geom_point(alpha = .25, color = '#D55E00') +
geom_smooth(se = F, color = '#56B4E9') +
labs(title = 'Fitted vs. Observed', y = 'log(y)') +
labs(x = 'Fitted value') +
hrbrthemes::theme_ipsum_rc() +
theme(title = element_text(colour = '#585858'))
```
We continue to see some poor fitting cases and now our fit is flattening even more than it was.
This is a fairly typical result. Transformations often exacerbate data issues or fail to help. What's more, some of them lead to more difficult interpretation, or aren't even applicable (e.g. categorical, ordinal targets).
Outliers, if there was actually a standard for deeming something as such, are just indications that your model doesn't capture the data generating process in some fashion. Cutting data out of the modeling process for that reason hasn't been acceptable for a long time (if it ever was).
Data abounds where a standard linear model performs poorly or doesn't do a good enough job capturing the nuances of the data. There may be nonlinear relationships as above, dependency in the observations, known non-Gaussian data etc. One should be prepared to use models better suited to the situation, rather than torturing the data to fit a simplified modeling scheme.
## Polynomial Regression
A common application in regression to deal with nonlinear relationships involves *polynomial regression*. For the feature in question, $x$, we add terms e.g. quadratic ($x^2$), cubic ($x^3$) etc. to get a better fit. Consider the following data situation.
```{r datasetup, echo=FALSE}
set.seed(123)
x = rnorm(1000)
y = x - x ^ 2 + .25 * x ^ 3 + rnorm(1000)
poly_dat = data.frame(x, y)
poly_dat %>%
plot_ly() %>%
add_markers( ~ x, ~ y, marker = list(color = '#D55E00', opacity = .25)) %>%
config(displayModeBar = F) %>%
theme_plotly()
```
Let's fit a quadratic term and note the result.
```{r polymod, echo=1}
mod_poly = lm(y ~ poly(x, 2))
mod_poly3 = lm(y ~ poly(x, 3))
poly_dat %>%
add_predictions(mod_poly) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_line(aes(y = pred), size = 1, alpha = 1) +
labs()
```
The R^2^ for this is `r broom::glance(mod_poly)$r.squared %>% round(2)`, that's great! But look closely and you might notice that we aren't capturing the lower tail of this target at all, and not doing so great for observations that are very high on both variables. Here's what the data and fit would look like if we extend the data based on the underlying true function, and things only get worse.
```{r poly_ack, echo=FALSE}
x2 = c(runif(250, -5, -2), runif(250, 2, 5))
y2 = x2 - x2 ^ 2 + .25 * x2 ^ 3 + rnorm(500)
mod_gam = gam(y ~ s(x), data = poly_dat)
newdat = rbind(poly_dat, data.frame(x = x2, y = y2))
newdat %>%
add_predictions(mod_poly) %>%
ggplot(aes(x, y)) +
geom_point() +
geom_line(aes(y = pred), size = 1, alpha = 1) +
labs()
```
<br>
Part of the reason is that, outside of deterministic relationships due to known physical or other causes, you are unlikely to discover a quadratic relationship between variables among found data. Even when it appears to fit, without a lot of data it is almost certainly *overfit* due to this reason. Fitting a polynomial is more akin to enforcing our vision of how the data should be, rather than letting the data speak for itself. Sure, it might be a good approximation some of the time, just as assuming a linear relationship is, but often it's just wishful thinking.
Compare the previous result to the following fit from a generalized additive model. GAMs are susceptible to extrapolation, as is every statistical model ever created. However, the original fit (in red) is much better. Notice how it was better able to follow the straightened-out data points at the high end, rather than continuing the bend that the quadratic approach enforced.
```{r polyplusgam, echo=FALSE}
# newdat %>%
# add_predictions(mod_poly) %>%
# plot_ly() %>%
# add_markers( ~ x,
# ~ y,
# marker = list(color = '#D55E00', opacity = .1),
# showlegend = F) %>%
# add_lines( ~ x, ~ pred, name = 'poly', line = list(color = '#56B4E9')) %>%
# add_lines(
# ~ x,
# ~ pred,
# name = 'gam',
# line = list(color = '#a703ff'),
# data = add_predictions(newdat, mod_gam) %>% arrange(x)
# ) %>%
# add_lines(
# ~ x,
# ~ pred,
# name = 'gam',
# line = list(color = '#ff1803'),
# data = add_predictions(poly_dat, mod_gam) %>% arrange(x)
# ) %>%
# config(displayModeBar = F) %>%
# theme_plotly()
newdat %>%
add_predictions(mod_poly) %>%
ggplot(aes(x, y)) +
geom_point(alpha = .05) +
geom_line(
aes(y = pred),
size = 1,
alpha = 1,
data = . %>% add_predictions(mod_poly)
) +
annotate(x = -4, y = -10, geom = 'text', label = 'Polynomial', color = okabe_ito[2]) +
geom_line(
aes(y = pred),
color = okabe_ito[3],
size = 1,
alpha = 1,
data = . %>% add_predictions(mod_gam)
) +
geom_line(
aes(y = pred),
color = okabe_ito[6],
size = 1,
alpha = 1,
data = poly_dat %>% add_predictions(mod_gam)
) +
annotate(x = 0, y = -10, geom = 'text', label = 'GAM', color = okabe_ito[6]) +
labs()
```
### A more complex relationship
Perhaps you would have been satisfied with the initial quadratic fit above or perhaps a cubic fit[^cubic]. We may come across a situation where the target of interest $y$ is a function of some covariate $x$, whose effect is not straightforward at all. Consider the following functional form for x:
$$f(x) = sin(2(4x-2)) + 2e^{-(16^2)(x-.5)^2} + \epsilon$$
$$\epsilon \sim N(0,.3^2)$$
Let's generate some data and take a look at it visually.
```{r simData}
set.seed(123)
x = runif(500)
mu = sin(2 * (4 * x - 2)) + 2 * exp(-(16 ^ 2) * ((x - .5) ^ 2))
y = rnorm(500, mu, .3)
d = data.frame(x, y)
```
```{r simDataPlot, echo=F}
# plot_ly(data = d) %>%
# add_markers( ~ x, ~ y, marker = list(color = '#D55E00', opacity = .5)) %>%
# theme_plotly() %>%
# config(displayModeBar = F) %>%
# layout()
d %>%
ggplot(aes(x, y)) +
geom_point()
```
#### Polynomial regression is problematic
A standard linear regression is definitely not going to capture this relationship. As above, we could try and use polynomial regression here, e.g. fitting a quadratic or cubic function within the standard regression framework. However, this is unrealistic at best, and at worst, isn't useful for complex relationships. In the following, even with a polynomial of degree 15 the fit is fairly poor in many areas, and 'wiggles' in some places where there doesn't appear to be a need to. You can (double) click to isolate specific fits.
```{r polyreg, echo=FALSE}
fits = map_df(
seq(3, 15, 3),
function(p)
tibble(x = x, y = y, polynomial = p, fits = fitted(lm(y ~ poly(x, p))))
) %>%
mutate(polynomial = factor(polynomial, labels = seq(3, 15, 3)))
# keep plotly here
plot_ly(data = d) %>%
add_markers( ~ x,
~ y,
marker = list(color = '#D55E00', opacity = .2),
showlegend = F) %>%
add_lines( ~ x, ~ fits, color = ~ polynomial, data = fits) %>%
config(displayModeBar = F) %>%
theme_blank()
```
The same would hold true for other approaches that require the functional form to be specified (e.g. so-called logistic growth curve models). It's maybe also obvious that a target transformation (e.g. log) isn't going to help us in this case either. In general, we'll need tools better suited to more complex relationships, or simply ones that don't require us to overfit/simplify the relationship we see, or guess about the form randomly until finding a decent fit.
[^cubic]: The example was actually generated with a cubic polynomial.
[^basergraphics]: In case you are wondering, yes, these diagnostic plots are in fact base R graphics, and they still can look good with some work.