-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathlinear_model_extensions.qmd
1756 lines (1252 loc) · 85.3 KB
/
linear_model_extensions.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Extending the Linear Model {#sec-lm-extend}
![](img/chapter_gp_plots/gp_plot_5.svg){width=75% fig-align='center'}
```{r}
#| include: false
#| label: r-data-import
df_reviews = read_csv('data/movie_reviews.csv')
df_happiness = read_csv('data/world_happiness_2018.csv')
df_happiness_all = read_csv('data/world_happiness_all_years.csv')
model_baseline = lm(rating ~ children_in_home + genre, data = df_reviews)
model_interaction = lm(rating ~ children_in_home * genre, data = df_reviews)
```
```{python}
#| include: false
#| label: py-data-import
import pandas as pd
import numpy as np
import statsmodels.formula.api as smf
np.set_printoptions(formatter={'float': lambda x: '{0:0.4f}'.format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df_reviews = pd.read_csv('data/movie_reviews.csv')
df_happiness = pd.read_csv('data/world_happiness_2018.csv')
```
With just linear and generalized linear models, we have a very solid foundation for modeling, and we've seen how there is a notable amount we can do with a conceptually simple approach. We've also seen how we can extend the linear model to handle different types of target distributions to help us understand and make some inferences about the relationships between our features and target.
In this chapter, we will extend our linear models with additional common and valuable modeling tools. These methods provide good examples of how we can think about our data and modeling approach in different ways, and they can serve as a foundation for exploring more advanced techniques in the future. A thread that binds these techniques together is the ability to use a linear model to investigate explicitly nonlinear relationships!
## Key Ideas {#sec-lm-extend-key-ideas}
- The standard and generalized linear models are great and powerful starting points for modeling, but there's even more we can do!
- Linear models can be used to model nonlinear feature-target relationships!
- While these seem like different approaches, we can still use our linear model concepts and approach at the core, take similar estimation steps, and even have similar interpretation. However, we'll have even more results to explore and interpret.
### Why this matters {#sec-lm-extend-why}
The linear model is a great starting point for modeling. It is a simple approach that can be used to model a wide variety of relationships between features and targets, and it's also a great way to get a feel for how to think about modeling. But linear and generalized models are just the beginning, and the models depicted here are common extensions used in a variety of disciplines and industries. More generally, the following techniques allow for nonlinear relationships while still employing a linear model approach. This is a very powerful tool to have in your toolkit, and it's a great way to start thinking about how to model more complex relationships in your data.
### Helpful context {#sec-lm-extend-good}
While these models are extensions of the linear model, they are not significantly more complicated in terms of how they are implemented or how they are interpreted. However, like anything new, it can take a bit more effort to understand. You likely want to be comfortable with standard linear models at least before you start to explore these extensions.
## Interactions {#sec-lm-extend-interactions}
Things can be quite complex in a typical model with multiple features, but just adding features may not be enough to capture the complexity of the relationships between features and target. Sometimes, we need to consider how features interact with each other to better understand how they correlate with the target. A common way to add complexity in linear models is through **interactions**. This is where we allow the effect of a feature to vary depending on the values of another feature, or even itself!
As a conceptual example, we can think about a movie's rating being different for movies from different genres. For example, maybe by default ratings are higher for kids movies, and lower for horror movies. But, genre and season might work together in some way to affect rating, e.g., action movies get higher ratings in summer. Or maybe having kids in the home might also interact with genre ratings by naturally resulting in higher ratings for kids movies. As a different example, we might also consider that the length of a movie might positively relate to rating, but plateau or even have a negative effect on rating after a certain point. In other words, it would have a **curvilinear** effect where really long movies aren't as highly rated as those of shorter length.
All of these are types of interactions we can explore. Interactions allow us to incorporate nonlinear relationships into the model, and so greatly extend the linear model's capabilities - we basically get to use a linear model in a nonlinear way!
With that in mind, let's explore how we can add interactions to our models. Going with one of our examples, let's see how having kids impacts the relationship between genre and rating. We'll start with a standard linear model, and then add an interaction term. Using a formula approach makes it very straightforward to add an interaction term. We just need to add a `:` between the two features we want to interact, or a `*` to denote both main effects and the interaction. As elsewhere, we present simplified results in the next table.
:::{.panel-tabset}
##### R
```{r}
#| label: model-interaction-r
#| output: false
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
model_baseline = lm(rating ~ children_in_home + genre, data = df_reviews)
model_interaction = lm(rating ~ children_in_home * genre, data = df_reviews)
summary(model_interaction)
```
##### Python
```{python}
#| echo: true
#| label: model-interaction-py
#| eval: false
import pandas as pd
import statsmodels.formula.api as smf
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
model_baseline = smf.ols(
formula = 'rating ~ children_in_home + genre',
data = df_reviews
).fit()
model_interaction = smf.ols(
formula = 'rating ~ children_in_home * genre',
data = df_reviews
).fit()
model_interaction.summary()
```
:::
```{r}
#| echo: false
#| label: model-interaction-get-reference-group
ref = model_baseline$xlevels$genre[1]
```
Here is a quick look at the model output for the interaction vs. no interaction interaction model. Starting with the baseline model, the coefficients look like what we've seen before, but we have several coefficients for genre. The reason is that genre is composed of several categories, and converted to a set of dummy variables (refer to @sec-lm-categorical-features and @sec-data-cat). In the baseline model, the intercept tells us what the mean is for the reference group, in this case `r ref`, and the genre coefficients tell us the difference between the mean for that genre and the reference genre. For example, the mean rating for `r ref` is `r round(coef(model_baseline)['(Intercept)'], 2)`, and the difference between that genre rating for the drama genre is `r round(coef(model_baseline)['genreDrama'], 2)`. Adding the two gives us the mean for drama movies `r round(coef(model_baseline)['(Intercept)'], 2)` + `r round(coef(model_baseline)['genreDrama'], 2)` = `r round(coef(model_baseline)['(Intercept)'] + coef(model_baseline)['genreDrama'], 2)`. We also have the coefficient for the number of children in the home, and this does not vary by genre in the baseline model.
```{r}
#| echo: false
#| label: tbl-model-interaction-output
#| tbl-cap: Model Coefficients with Interaction
library(broom)
tidy_base = tidy(model_baseline) |>
rename(
feature = term,
coef = estimate,
se = std.error,
t = statistic,
p = p.value
)
tidy_interaction = tidy(model_interaction) |>
rename(
feature = term,
coef = estimate,
se = std.error,
t = statistic,
p = p.value
)
nchildren_coef = round(coef(model_interaction)['children_in_home'], 2)
sign_nchildren_coef = sign(nchildren_coef)
genre_kids_coef = round(coef(model_interaction)['genreKids'], 2)
nkids_comedy_coef = round(coef(model_interaction)['children_in_home:genreComedy'], 2)
nkids_kids_coef = round(coef(model_interaction)['children_in_home:genreKids'], 2)
sign_nkids_kids_coef = sign(nkids_kids_coef)
# mean(df_reviews$children_in_home == 0)
# mean(df_reviews$genre == ref)
# Create a gt table from the tidy coefficients
# COLOR CODING WON'T WORK IN LATEX BECAUSE IT SUCKS
tidy_base |>
full_join(tidy_interaction, by = 'feature', suffix = c('_base', '_inter')) |>
select(feature, starts_with('coef')) |>
gt(decimals = 3) |>
sub_missing(
missing_text = "",
) |>
# gt_highlight_rows(
# columns = vars(coef_base, coef_inter),
# rows = str_detect(feature, 'children_in_home$|children.*Kids'),
# fill = okabe_ito[5],
# alpha = .1
# ) |>
tab_caption('Model coefficients with and without an interaction')
```
But in our other model we have an interaction between two features: 'children in the home' and 'genre'. So let's start with the coefficient for children. It is `r nchildren_coef`, which means that for every additional child, the rating for any movie `r ifelse(sign_nchildren_coef, 'increases', 'decreases')` by that amount. But because of the interaction, we now interpret that as the *effect of children when genre is the reference group `r ref`*.
Now let's look at the interaction effect for children and the kids genre. It is `r nkids_kids_coef`, which means that for the kids genre, the *effect of having children in the home `r ifelse(sign_nkids_kids_coef, 'increases', 'decreases')`* by that amount. So our actual effect for an additional child for the kids genre is `r nchildren_coef` + `r nkids_kids_coef` = `r round(nchildren_coef + nkids_kids_coef, 2)` `r ifelse(sign_nchildren_coef + sign_nkids_kids_coef, 'increase', 'decrease')` in the expected review rating. In other words, the effect of children in the home is stronger for kids movies than for other genres.
We can also interpret this interaction in the reverse fashion. It is also correct to say that the difference in rating between the kids genre and the reference group `r ref` when there are no kids in the home is `r genre_kids_coef`. This means kids movies are generally rated worse if there are no kids in the home. But, with an increase in children, the difference in rating between the kids genre and `r ref` `r ifelse(sign_nkids_kids_coef, 'increases', 'decreases')` by `r abs(nkids_kids_coef)`. In other words, it is a **difference in differences**[^diffindiff].
[^diffindiff]: Some models that employ an interaction that investigates categorical group differences like this actually call their model a difference-in-difference model.
When we talk about differences in coefficients across values of features, it can get a little bit hard to follow. To combat this, we believe that in every case that you employ an interaction, you should look at the interaction visually for interpretation. Here is a plot of the predictions from the interaction model. We can see that the effect of children in the home is stronger for kids movies than for other genres, which makes a whole lot of sense! In other genres, the effect of having children seems to produce little difference, and in others it still has a positive effect, but not as strong as for kids movies.
::: {.content-visible when-format='html'}
```{r }
#| echo: false
#| label: fig-model-interaction-plot
#| fig-cap: Interaction Plot
library(ggeffects)
p_init = plot(
ggpredict(model_interaction, terms = c('children_in_home', 'genre')),
alpha = .05,
# line_size = I(rep(c(rep(.10, 7), 2), 4)) # values of children in home = 0:3, kids is last genre, but it's ignored anyway
line_size = 1
)$data |>
as_tibble()
p_init |>
ggplot(aes(x = x, y = predicted, color = group, fill = group)) +
geom_line(linewidth = 1.5) +
# geom_ribbon(aes(ymin = conf.low, ymax = conf.high), alpha = .1, color = NA) +
geom_borderline(linewidth = 2.5, data = p_init |> filter(group == 'Kids') ) +
scale_color_manual(values = unname(okabe_ito), aesthetics = c('color', 'fill')) +
# scale_color_brewer(palette = 'Set1', aesthetics = c('color', 'fill')) +
scale_y_continuous(breaks = seq(0, 5, .25)) +
labs(
x = 'Children in home',
y = 'Predicted\nrating',
caption = 'Bold line is the Kids genre'
) +
theme_clean() +
theme(
panel.grid.major = element_line(color = alpha('black', .05)),
)
p_print = p_init |>
filter(group != 'Kids') |>
ggplot(aes(x = x, y = predicted)) +
geom_line(aes(linetype = group), linewidth = 1) +
geom_line(
aes(linetype=group),
linewidth = 3,
linetype = 'solid',
data = p_init |> filter(group == 'Kids') |> droplevels()
) +
scale_y_continuous(breaks = seq(0, 5, .25)) +
labs(
x = 'Children in home',
y = 'Predicted\nrating',
caption = 'Bold line is the Kids genre'
) +
theme(
legend.key.size = unit(1.5, 'cm'),
panel.grid.major = element_line(color = alpha('black', .05)),
)
# plot(modelbased::estimate_expectation(model_interaction), si_alpha = .01)
ggsave('img/lm-extend-interaction_plot.svg', p_print, width = 8, height = 6)
```
:::
::: {.content-visible when-format='pdf'}
![Interaction Plot](img/lm-extend-interaction_plot.svg){#fig-model-interaction-plot}
:::
So we can see that interactions can allow a linear effect to vary depending on the values of another feature. But the real take home message from this is that *the general effect is actually not just a straight line*! The linear effect *changes* depending on the setting. Furthermore, the effect for children when the interaction is present only applies when 'genre' is at its default group, or when other features are at their default or zero.
So, when we have interactions, we can't talk about a feature's relationship with a target without considering the other features it interacts with. Some might see this as a downside, but it's actually how most feature relationships work - they don't exist in isolation. Interactions let us model these complex relationships, and they're used a lot in real-world situations.
### Summarizing Interactions
So what is *the* effect of children in the home? Or a particular genre, for that matter? This is a problematic question, because the effect of one feature depends on the setting of the other feature. We can summarize interactions, and we show two ways in which to do so. But we need to be very careful about summarizing a single feature's effects when we know it interacts with another feature.
### Average effects {#sec-lm-extend-avgeffect}
One thing we can do is get an average effect for a feature. In other words, we can say what the effect of a feature is *on average* across the settings of the other features. This is called the **average marginal effect**, something we've talked about in @sec-avg-marginal-effects[^ME]. Here is the average effect of children in the home across all genres.
[^ME]: These results are provided by the [marginaleffects]{.pack} R package, which is a great tool.
\small
```{r}
#| echo: false
#| label: tbl-model-interaction-avg-effect
#| tbl-cap: Average Marginal Effects of Children in the Home
library(marginaleffects)
me_children = marginaleffects::avg_slopes(model_interaction, variables = c("children_in_home"))
me_children |>
as_tibble() |>
select(-s.value) |>
gt() |>
fmt_number(
columns = c(estimate),
decimals = 3
)
```
\normalsize
So-called **marginal effects**, and related approaches such as SHAP values (see @sec-knowing-shap-values), attempt to boil down the effect of a feature to a single number. Here we see the average coefficient for children in the home is `r round(me_children$estimate, 2)`, with a range from `r round(me_children$conf.low, 2)` to `r round(me_children$conf.high, 2)`.
But this is difficult even in the simpler GLM settings, and can be downright misleading in interaction models. We saw from @tbl-model-interaction-output that this average is slightly larger than what we would estimate in the baseline model, and we saw in @fig-model-interaction-plot it's actually near zero (flat) for some genres in the interaction model. So what is the average effect really telling us? Consider a more serious case of drug effects across demographic groups, where the effect of the drug is much stronger for some groups than others. Would you want your doctor to prescribe you a drug based on the average effect across all groups, or the specific group to which you belong?
### ANOVA {#sec-lm-extend-anova}
A common method for summarizing categorical effects in linear models is through **analysis of variance** or ANOVA, something we briefly mentioned in our chapter introducing the linear model @sec-lm-categorical-anova. ANOVA breaks down the variance in a target attributable to different features or their related effects such as interactions. It's a bit beyond the scope here to get into all the details, but we demonstrate it here, as it's also used to summarize interactions. It also summarizes the random effects and spline terms we'll see in the coming sections on mixed and generalized additive models.
:::{.panel-tabset}
##### R
```{r}
#| label: anova-r
#| eval: false
anova(model_interaction)
```
##### Python
```{python}
#| label: anova-py
#| eval: false
sm.stats.anova_lm(model_interaction)
```
:::
```{r}
#| echo: false
#| label: tbl-anova-r
#| tbl-cap: ANOVA Table for an Interaction Model
anova(model_interaction) |>
tidy() |>
rename(
feature = term,
df = df,
sum_sq = sumsq,
mean_sq = meansq,
f = statistic,
p = p.value
) |>
gt() |>
sub_missing(
missing_text = "",
)
```
In this case, it doesn't appear that the general interaction effect is statistically significant if we use the typical .05 cut-off. We know the effect of children in the home varies across genres, but this result suggests maybe it's not as much as we might think. However, we also saw that the estimate for the children effect more than doubled for the kids genre, so maybe we don't want to ignore it. That's for you to decide.
The ANOVA approach can be generalized to provide a statistical test to compare models. For example, we can compare the baseline model to the interaction model to see if the interaction model is a better fit. However, it's entirely consistent with just looking at the interaction's statistical result in the ANOVA for the interaction model alone, so it doesn't provide any additional information in this case. Note that the only models that can be compared with ANOVA must be *nested*, i.e., one model is an explicit subset of the other.
:::{.callout-note title='ANOVA is not a model' collapse='true'}
It's worth noting that ANOVA is often confused with being a model itself. When people use it this way, it is just a linear regression with only categorical features, something that is usually only seen within strict experimental designs. It's pretty difficult to think of a linear regression setting where no continuous features would be of theoretical interest, but back when people were doing this stuff by hand, they just categorized everything to enable doing an ANOVA, which was tedious arithmetic but more manageable. It's a bit of a historical artifact, but might be useful for exploratory purposes of feature relationships. For model comparison however, other approaches are more general than ANOVA and generally preferred. An example is using AIC, or the other methods employed that we discuss the in machine learning chapters, like cross-validation error.
:::
### Interactions in Practice
When dealing with interactions in a model, it's best to consider how the feature-target relationship changes based on the values of other features it interacts with. Visualizing these effects is important in helping us understand how the relationships change. It's also helpful to consider what the predicted outcome is at important feature values, and how this changes with different feature values. This is the approach we've used with interactions, and it's a good strategy overall.
## Mixed Models {#sec-lm-extend-mixed-models}
### Knowing your data {#sec-lm-extend-mixed-knowing}
As much fun as modeling is, knowing your data is far more important. You can throw any model you want at your data, from simple to fancy, but you can count on disappointment if you don't fundamentally know the structure that lies within your data. Let's take a look at the following visualizations. In @fig-length-release-rating, we see a positive relationship between the length of the movie and ratings.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-length-release-rating
#| fig-cap: Linear Relationship Between Length of Movie and Rating
ggplot(df_reviews, aes(length_minutes, rating)) +
geom_point() +
geom_smooth(method="lm", se = FALSE) +
labs(
x = 'Length of movie (minutes)',
y = 'Rating',
# caption = 'Data from movie reviews'
)
ggsave("img/lm-extend-length_release_rating.svg", width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![Linear Relationship Between Length of Movie and Rating](img/lm-extend-length_release_rating.svg){#fig-length-release-rating}
:::
We could probably just stop there, but given what we just saw with interaction, we might think to ourselves to be ignoring something substantial within our data: genre. We might want to ask a question, "Does this relationship work the same way across the different genres?"
```{r}
#| echo: false
#| eval: false
#| label: fig-length-genre-rating
#| fig-cap: Genre Effects on Length and Rating
#| out-width: 100%
# p_mod = lm(rating ~ length_minutes*genre,
# df_reviews |> mutate(genre = recode(genre, "Action/Adventure" = "Act/Adv")))
# plot(modelbased::estimate_expectation(p_mod, trend = 'length_minutes', at = 'genre')) +
# scale_x_discrete(guide = guide_axis(n.dodge=2)) +
# labs(title = '', y = '') +
# theme_clean()
mods = df_reviews |>
nest_by(genre) |>
mutate(model = list(lm(rating ~ length_minutes, data = data))) |>
reframe(broom::augment(model))
p_slopes = ggplot(mods, aes(length_minutes, .fitted, color = genre)) +
geom_line(aes(linetype = genre), lwd = .5) +
scale_y_continuous(limits = c(1, 5)) +
labs(title = '', y = '', x = '')
p_mod = lm(rating ~ genre, df_reviews |> mutate(genre = recode(genre, "Action/Adventure" = "Act/Adv")))
p_dat = plot(modelbased::estimate_means(p_mod))$data
p_group_means = p_dat |>
ggplot(aes(x = genre, y = Mean)) +
geom_dotplot(
aes(x = genre, y= rating),
binaxis='y',
stackdir='center',
dotsize=.1,
color = 'black',
data = df_reviews |>
mutate(genre = recode(genre, "Action/Adventure" = "Act/Adv")),
position = position_jitter(width = .0, height = .15),
alpha = .25
) +
geom_pointrange(
aes(ymin = CI_low, ymax = CI_high, color = genre),
fatten = 5,
alpha = 1,
show.legend = FALSE
) +
scale_x_discrete(guide = guide_axis(n.dodge=2)) +
labs(title = '', y = '', x = '') +
scale_y_continuous(limits = c(1, 5))
see::plots(
p_group_means,
p_slopes
) +
plot_layout(guides = 'collect') &
# scale_color_brewer(palette = 'Set1') &
plot_annotation(
title = 'Genre Effects on Length and Rating',
caption = 'Left Plot: Group Means by Genre\nRight Plot: Length Slopes by Genre'
) &
theme(
panel.grid.major.y = element_line(color = scales::alpha('black', .1)),
legend.key.size = unit(1.25, "cm"),
)
ggsave("img/lm-extend-length_genre_rating.svg", width = 8, height = 6)
```
![Genre Effects on Length and Rating](img/lm-extend-length_genre_rating.svg){#fig-length-genre-rating}
A very quick examination of @fig-length-genre-rating might suggest that the rating varies by genre, and that the relationship between length and rating varies significantly over the different genres. The group means in the right panel show variability across genres. In addition, on the left panel, some genres show a strong positive relationship, some show less of a positive relationship, a couple even show a negative relationship, and one even looks flat. We can also see that they would have different intercepts. This is a very important thing to know about your data! If we had just run a model with length as a feature and nothing else, we would have missed this important information.
Now consider something a bit more complicated. Here is a plot of the relationship between the length of a movie and the rating, but across release year. Again we might think there is notable variability in the effect across years, as some slopes are positive, some very strongly positive, and some are even negative. How can we account for this when there are so many group effects to consider?
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-length-release-rating-year
#| fig-cap: Release Year Effects on Length and Rating
# glimpse(df_reviews)
mods_release = df_reviews |>
nest_by(release_year) |>
mutate(model = list(lm(rating ~ length_minutes, data = data))) |>
reframe(broom::augment(model))
p_slopes_release = ggplot(mods_release, aes(length_minutes, .fitted, color = release_year)) +
geom_line(aes(group= release_year), lwd = 1) +
scale_color_scico(palette = 'oslo', end = .85, breaks = seq(1985, 2020, 10)) +
scale_y_continuous(limits = c(1, 5)) +
labs(title = '', y = '', x = '')
p_slopes_release
ggsave("img/lm-extend-length_release_rating_year.svg", width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![Release Year Effects on Length and Rating](img/lm-extend-length_release_rating_year.svg){#fig-length-release-rating-year}
:::
### Overview of mixed models {#sec-mixed-models-overview}
What we've just seen might initially bring to mind an interaction effect, and that's the right way to think about it! A **mixed model** can be used to incorporate that type of relationship into our model, which we can think of as a group interaction, without much hassle and additional explainability. But it's actually a quite flexible class that can also allow for more complicated but related types.
Before going too much further, the term *mixed model* is as vanilla as we can possibly make it, but you might have heard of different names such as *hierarchical linear models*, or *multilevel models*, or maybe *mixed-effects models*. Maybe you've even been exposed to ideas like *random effects* or *random slopes*. These are in fact all instances of what we're calling a *mixed model*.
What makes a model a *mixed* model? The mixed model is characterized by the idea that a model can have **fixed effects** and **random effects**. Fortunately, you've already encountered *fixed* effects -- those are the features that we have been using in all of our models so far! We are assuming a single true parameter (e.g., coefficient/weight) to estimate for each of those features, and that parameter is *fixed*.
In mixed models, a *random effect* instead comes from a specific distribution, and this is almost always a normal distribution. This random effect adds a unique source of variance in the target variable. This distribution of effects can be based on a grouping variable (such as genre), where we let those parameters, i.e., coefficients (or weights), vary across the groups, creating a distribution of values.
Let's take our initial example with movie length and genre. Formally, we might specify something like this:
$$
\text{Rating} = b_{\text{int[genre]}} + b_\text{length}*\text{length} + \epsilon
$$
In this formula, we are saying that genre has its own unique effect for this model in the form of specific intercepts for each genre[^mixedeps]. This means that whenever an observation belongs to a specific genre, it will have an intercept that reflects that genre, and that means that two observations with the same movie length but from different genres would have different predictions.
[^mixedeps]: The error term $\epsilon$ is still assumed normal with mean zero and variance $\sigma^2$.
We also posit, and this a key point, that those group effects come from a *random* distribution. We can specify that as:
$$b_{\text{int[genre]}} \sim \text{N}(b_\text{intercept}, \sigma_\text{int\_genre})$$
This means that the random intercepts will be normally distributed and the overall intercept is just the mean of those random intercepts, and with its own variance. It also means our model will have to estimate that variance along with our residual variance. Another very common depiction is as follows, that makes explicit the addition of a random effect to the intercept:
$$\text{re}_{[\text{int\_genre}]} \sim \text{N}(0, \sigma_\text{int\_genre})$$
$$b_{\text{int[genre]}} = b_\text{intercept} +\text{re}_{[\text{int\_genre}]}$$
The same approach would apply with a random slope, where we would have a random slope for each group, and that random slope would be normally distributed with its own variance. Compared to our first formulation it would look like this:
$$b_{\text{length[genre]}} \sim \text{N}(b_\text{length}, \sigma_\text{length\_genre})$$
A simple random intercept and slope is just the start. As an example, we can let the intercepts and slopes correlate with one another, and we could have multiple grouping factors, as well as allowing multiple features and even interactions themselves to vary by group! So mixed models can get quite complex, but the basic idea is still the same: we are allowing parameters to vary across groups, and we are estimating the variance of those parameters.
### Using a mixed model {#sec-mixed-models-using}
One of the key advantages of a mixed is that we can use it when the observations within a group are not independent. This is a very common situation in many fields, and it's a good idea to consider this when you have grouped data. As an example we'll use the happiness data for all available years, and we'll consider the country as a grouping variable. In this case, observations within a country are likely to be more similar to each other than to observations from other countries. Such **longitudinal data** is a classic example of when to use a mixed model. This is also a case where we wouldn't just throw in 'country' as a feature like any other factor, since there are `r n_distinct(df_happiness_all$country)` countries in the data. We need an easier way to handle so many groups!
In general, to use mixed models we have to specify a random effect pertaining to the categorical feature of focus, but that's the primary difference from our previous approaches used for linear or generalized linear models. For our example, we'll look at a model with a random intercept for the country feature, and one that adds a random coefficient for the yearly trend across countries. This means that we are allowing the intercepts and slopes to vary across countries, and the intercepts and slopes can correlate with one another. Furthermore, by recoding year to start at zero, the intercept will represent the happiness score at the start of the data. In addition, to see a more reasonable effect, we also divide the yearly trend by 10, so the coefficient provides the change in happiness score per decade.
:::{.panel-tabset}
##### R
We'll use the [lme4]{.pack} package in R which is the most widely used package for mixed models.
```{r}
#| eval: false
#| label: r_mixed_model
library(lme4)
df_happiness_all = read_csv("https://tinyurl.com/worldhappinessallyears")
df_happiness_all = df_happiness_all |>
mutate(decade_0 = (year - min(year))/10)
# random intercepts are specified by a 1
model_ran_int = lmer(
happiness_score ~ decade_0 + (1| country),
df_happiness_all
)
model_ran_slope = lmer(
happiness_score ~ decade_0 + (1 + decade_0 | country),
df_happiness_all
)
# not shown
summary(model_ran_int)
summary(model_ran_slope)
```
##### Python
As with our recommendation with GAMs later, R is the better tool for mixed models, as the functionality is overwhelmingly better there for modeling and post-processing. However, you can use [statsmodels]{.pack} in Python to fit them as well[^mixedpy].
[^mixedpy]: One of your authors worked for several years with the key developer of the mixed models functionality in [statsmodels]{.pack}. As such, we can say there is zero doubt about the expertise going into its development, as there are few in the world with such knowledge. Even so, you probably won't find the functionality is not as mature or as expansive as what you get in R.
```{python}
#| eval: true
#| label: python_mixed_model
#| output: false
import statsmodels.api as sm
df_happiness_all = pd.read_csv("https://tinyurl.com/worldhappinessallyears")
df_happiness_all = (
df_happiness_all
.assign(decade_0 = lambda x: (x['year']- x['year'].min())/10)
)
model_ran_int = sm.MixedLM.from_formula(
"happiness_score ~ decade_0",
df_happiness_all,
re_formula='1',
groups=df_happiness_all["country"]
).fit()
model_ran_slope = sm.MixedLM.from_formula(
"happiness_score ~ decade_0",
df_happiness_all,
re_formula='1 + decade_0',
groups=df_happiness_all["country"]
).fit()
# not shown
# model_ran_int.summary()
# model_ran_slope.summary()
```
:::
@tbl-mixed-model shows some typical output from a mixed model, focusing on the random slope model. The fixed effect part (Fixed) is your basic GLM result and interpreted the same way. Nothing new there, and we can see a slight positive decade trend in happiness, though maybe not a strong one. But the random effects (Random) are where the action is! We can see the standard deviation of the random effects, i.e., the intercepts and slopes. We can also see the correlation between the random intercepts and random slopes. And finally we also have the residual (observation level) standard deviation, which is interpreted the same as with standard linear regression. Depending on your modeling tool, the default result may be in terms of variances and covariances rather than standard deviations and correlations, but generally similar.
\small
```{r}
#| echo: false
#| label: tbl-mixed-model
#| tbl-cap: Mixed Model Results
#| out.width: 90%
library(lme4)
df_happiness_all = read_csv("data/world_happiness_all_years.csv")
df_happiness_all_show = df_happiness_all |>
mutate(decade_0 = (year - min(year))/10)
model_ran_int = lmer(
happiness_score ~ decade_0 + (1| country),
df_happiness_all_show
)
model_ran_slope = lmer(
happiness_score ~ decade_0 + (1 + decade_0 | country),
df_happiness_all_show
)
vc = model_ran_slope |> mixedup::extract_vc(ci_level = 0)
vcor = model_ran_slope |> mixedup::extract_vc(ci_level = 0, show_cor = TRUE)
vcor = vcor$Cor$country
cor_sign = ifelse(sign(vcor[2]) > 0, 'positive', 'negative')
# gtsummary::tbl_regression(model_ran_slope, tidy_fun = broom.mixed::tidy) # doesn't include intercept and default is 'include everything'.
# broom.mixed::tidy(model_ran_slope) |>
# mutate(
# effect = ifelse(str_detect(effect, 'ran_pars'), 'Random', 'Fixed'),
# term = ifelse(str_detect(term, 'Observation'), '', term),
# term = str_remove_all(term, '\\(|\\)'),
# ) |>
# group_by(effect) |>
# gt() |>
# sub_missing(
# missing_text = "",
# )
parameters::parameters(model_ran_slope) |>
as_tibble() |>
select(-CI) |>
mutate(
across(where(is.numeric), \(x) as.character(round(x, 2))),
Effects = str_to_title(Effects)
) |>
select(-c(t:p)) |>
group_by(Effects) |>
gt() |>
sub_missing(
missing_text = "",
)
```
\normalsize
```{r}
#| echo: false
#| label: get-happiness-range
happy_range = round(range(df_happiness_all$happiness_score))
```
In this case, we can see notable variability attributable to the random effects. How do we know this? Well, if our happiness score is on a `r happy_range[1]` - `r happy_range[2]` scale, and we have a standard deviation of happiness of `r round(sd(df_happiness_all$happiness_score), 2)` before accounting for anything else. So, we might surmise that having an effect of that size (roughly `r round(vc$sd[1], 2)`) is a relatively notable amount, as it suggests we would move around about a standard deviation of happiness just going from country to country (on average).
The trend variability is also notable. Although the overall effect is small, the coefficient standard deviation shows that it ranges from a notable negative trend to a notable positive trend ($\pm$ `r round(vc$sd[2], 2)`) depending on which country we're looking at.
In addition, we can also see that the correlation between the random intercepts and random slopes is `r cor_sign`, which means that the groups with higher starting points have `r ifelse(cor_sign == 'positive', 'more positive', 'more negative')` slopes.
Now let's look at the estimates for the random effects for the model with both intercepts and slopes[^mixedup].
[^mixedup]: MC provides a package for mixed models in R called [mixedup]{.pack}. It provides a nice way to extract random effects and summarize such models ([link](https://github.com/m-clark/mixedup)).
:::{.panel-tabset}
##### R
```{r}
#| eval: true
#| label: r_mixed_model_random_effects
estimated_RE = ranef(model_ran_slope)
# mixedup::extract_random_effects(model_ran_slope) # prettier version
```
##### Python
```{python}
#| eval: true
#| label: python_mixed_model_random_effects
estimated_RE = pd.DataFrame(model_ran_slope.random_effects)
```
:::
```{r}
#| echo: false
#| eval: false
#| label: fig-random-effects
#| fig-cap: Estimated Random Effects
#| out-width: 100%
#| fig-height: 8
model_ran_int_code = lmer(
happiness_score ~ decade_0 + (1| cntry_code),
df_happiness_all_show
)
model_ran_slope_code = lmer(
happiness_score ~ decade_0 + (1 + decade_0 | cntry_code),
df_happiness_all_show
)
x = visibly::plot_coefficients(model_ran_slope_code, ranef = TRUE, which_ranef = 'cntry_code')
all_re = mixedup::extract_random_effects(model_ran_slope_code)
re_usa = all_re |>
filter(group == 'USA')
ran_coef_usa = mixedup::extract_random_coefficients(model_ran_slope_code) |>
filter(group == 'USA')
ran_ef_extreme_pos = all_re |>
arrange(desc(value)) |>
group_by(effect) |>
slice(1:3)
ran_ef_extreme_neg = all_re |>
arrange(value) |>
group_by(effect) |>
slice(1:3)
p = (
(x[[1]] + labs(x = '', y = 'Intercept')) /
(x[[2]] + labs(x = 'Country', y = 'Decade\nTrend')) + scale_linewidth_manual(values = rep(10, 1000))
) &
scale_x_discrete(guide = guide_axis(n.dodge=2)) &
plot_layout(guides = 'collect') &
plot_annotation(
subtitle = 'Estimated random effects for country',
caption = 'Each point represents the random effect for a country.\nDark intervals are significantly different from 0.'
) &
theme(
axis.ticks.x = element_line(linewidth = .5),
axis.text.x = element_text(size = 4, angle = 90, hjust = 1, vjust = .5),
panel.grid.major = element_blank()
# axis.text.x = element_blank(),
)
p
ggsave("img/lm-extend-random_effects.svg", width = 8, height = 8)
```
![Estimated Random Effects](img/lm-extend-random_effects.svg){#fig-random-effects width=100%}
How do we interpret these *deviations*? For starters, they are deviations from the fixed effect for the intercept and decade trend coefficient. So here that means anything negative is an intercept or slope below the corresponding fixed effect value, and anything positive is above that value. If we want the specific effect for a country, we just add its random effect to the fixed effect, and we can refer to those as **random coefficients**.
For example, if we wanted to know the effects for the US, we would add its random effects to the population level fixed effects. This is exactly what we did in the previous section in our interpretation with the interaction model. However, you can typically get these from the package functionality directly. The result shows that the US starts at a very high happiness score, but actually has a negative trend over time[^lifeexpusa].
[^lifeexpusa]: Life expectancy in the US has actually [declined recently](https://www.hsph.harvard.edu/news/hsph-in-the-news/whats-behind-shocking-u-s-life-expectancy-decline-and-what-to-do-about-it/).
:::{.panel-tabset}
##### R
```{r}
#| eval: true
#| label: r_mixed_model_random_effects_usa
# coef(model_ran_slope) stored here
ranef_usa = estimated_RE$country
ranef_usa = ranef_usa |>
rownames_to_column('country') |>
filter(country == 'United States')
ranef_usa[1, c('(Intercept)', 'decade_0')] + fixef(model_ran_slope)
```
##### Python
```{python}
#| eval: true
#| label: python_mixed_model_random_effects_usa
ranef_usa = estimated_RE['United States'].rename({'Group': 'Intercept'})
ranef_usa + model_ran_slope.fe_params
```
:::
:::{.callout-note title='Averages in Mixed Models' collapse='true'}
In the linear mixed effect model setting with a random intercept, the fixed effects can be seen as the (population) average effects, but this is not exactly what you are getting from the mixed model. To make the distinction clear, consider family groups and a gender effect for males vs. females. The linear regression and some other types of models (e.g., estimated via generalized estimating equations) would give you the average effect male-female difference across all families. The mixed model actually tells you the male-female difference as if they were in the same family (e.g., siblings). Again, in the simplest mixed model setting these are the same. Beyond that that, when we start dealing with random slopes and non-gaussian distributions, they are not.
In general, if we set the random effect to 0 to get a prediction, that tells us what the prediction would be for a typical group, in this case, a typical country. Often we want to get something more like the average slope or prediction across countries that we would have with linear regression. This gets us back to the idea of the **marginal effect** we discussed earlier. While the mechanics are not straightforward for mixed models, the tool use generally takes no additional effort.
:::
Let's plot those random coefficients together to see how they relate to each other.
```{r}
#| echo: false
#| eval: false
#| label: fig-random-effects-cor-plot
#| fig-cap: Random Effects for Mixed Model
#| out-width: 100%
all_re_wide = mixedup::extract_random_coefficients(model_ran_slope_code) |>
select(group, effect, value) |>
pivot_wider(names_from = effect, values_from = value)
p_re_cor = all_re_wide |>
ggplot(aes(x = Intercept, y = `decade_0`)) +
geom_hline(yintercept = 0, color = 'gray75', linewidth = 2) +
geom_point(aes(),
data = all_re_wide |> filter(group %in% c(ran_ef_extreme_neg$group, ran_ef_extreme_pos$group)),
color = okabe_ito[2],
size =3
) +
geom_point(aes()) +
geom_text_repel(
aes(label = group, x = Intercept, y = `decade_0`),
segment.color = 'gray75',
size = 2,
max.overlaps = 20
) +
labs(
x = 'Intercept',
y = 'Decade\nTrend',
subtitle = 'Estimated random coefficients for country',
caption = 'Some more extreme country points highlighted.'
)
p_re_cor
ggsave("img/lm-extend-random_effects_cor_plot.svg", width = 8, height = 6)
```
![Random Effects for Mixed Model](img/lm-extend-random_effects_cor_plot.svg){#fig-random-effects-cor-plot width=100%}
From this plot, we can sense why the estimated random effect correlation was negative. For individual country results, we can see that recently war-torn regions like Syria and Afghanistan have declined over time even while they started poorly as well. Some like Guinea and Togo started poorly but have improved remarkably over time. Many western countries started high and mostly stayed that way, though generally with a slight decline. Perhaps there's only one direction to go when you're already starting off well!
:::{.callout-note title='Always scale features for mixed models' collapse='true'}
Your authors have run a lot of these models. Save yourself some trouble and standardize or otherwise scale your features before fitting the model. Just trust us, or at least, don't be surprised when your model doesn't converge.
:::
### Mixed model summary {#sec-mixed-models-summary}
For a model with just one feature, we certainly had a lot to talk about! And this is just a glimpse of what mixed models have to offer, and the approach can be even richer than what we've just seen. But you might be asking- Why don't I just put genre country the model like other categorical features? In the case of genre for movie reviews where there are few groups, that's okay. But doing that with features with a lot of levels would typically result in estimation issues due to having so many parameters to estimate. In general mixed models provide several advantages for the data scientist:
- Any coefficient can be allowed to vary by groups, including other random effects. It actually is just an interaction in the end as far as the linear predictor and conceptual model is concerned.
- The group-specific effects are *penalized*, which shrinks them toward the overall mean, and makes this a different approach from just adding a 'mere interaction'. This helps avoid overfitting, and that penalty is related to the variance estimate of the random effect. In other words, you can think of it as running a penalized linear model where the penalty is applied to the group-specific effects (see @sec-estim-penalty).
- Unlike standard interaction approaches, we can estimate the covariance of the random effects, and specify different covariance structures for observations within groups. Standard interactions implicitly assume independence.
- Because of the way they are estimated, mixed models can account for lack of independence of observations[^indieobs], which is a common issue in many datasets. This is especially important when you have repeated measures, or when you have a hierarchical structure in your data, such as students within schools, or patients within hospitals.
- Standard modeling approaches can estimate these difficult models very efficiently, even with thousands of groups and millions of observations.
- The group effects are like a very simplified **embedding** (@sec-data-cat), where we have taken a categorical feature and turned it into a numeric one, like those shown in @fig-random-effects. This may help you understand other embedding techniques that are used in other places like deep learning.
- When you start to think about random effects and/or distributions for effects, you're already thinking like a Bayesian (@sec-unc-bayes), who is always thinking about the distributions for various effects. Mixed models are the perfect segue from standard linear model estimation to Bayesian estimation, where everything is random.
- The random effect is akin to a latent variable of 'unspecified group causes'. This is a very powerful idea that can be used in many different ways, but importantly, you might want to start thinking about how you can figure out what those 'unspecified' causes may be!
- Group effects will almost always improve your model's predictive performance relative to not having them, especially if you weren't including those groups in your model because of how many groups there were.
[^indieobs]: Independence of observations is a key assumption in linear regression models, and when it's violated, the standard errors of the coefficients are biased, which can lead to incorrect inferences. Rather than hacking a model (so-called 'fixed effects' models) or 'correcting' the standard error (e.g., with some 'sandwich' or estimator), mixed models can account for this lack of independence through the model itself.
In short, mixed models are a fun way to incorporate additional interpretive color to your model, while also getting several additional benefits to help you understand your data!
## Generalized Additive Models {#sec-gam}
<!-- :::{.content-visible when-format='html'}
> Wiggle, wiggle, wiggle, yeah!
> -- LMFAO
:::
-->
<!-- TODO: Chelsea suggested diving a bit deeper here, and we should explain the function code a bit more. But getting into a spline demo seems a bit much. -->
### When straight lines aren't enough {#sec-gam-curve}
Linear models, as their name implies, generally assume a linear relationship between the features and the target by default. But fitting a line through your data isn't always going to be the best approach. Not every relationship is linear and not every relationship is monotonic. Sometimes, you need to be able to model a relationship that has a fair amount of nonlinearity -- they can appear as slight curves, waves, and any other type of wiggle that you can imagine.
In other words, we can go from the straight line here:
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-regular-linear-effect
#| fig-cap: A Standard Linear Relationship
df_happiness |>
filter(year == 2018) |>
ggplot(aes(healthy_life_expectancy_at_birth, happiness_score)) +
geom_point() +
geom_smooth(method = "lm", se = FALSE) +
labs(x = 'Life Expectancy', y = 'Happiness\nScore')
ggsave("img/lm-extend-linear_effect.svg", width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![A Standard Linear Relationship](img/lm-extend-linear_effect.svg){#fig-regular-linear-effect}
:::
To the curve seen here:
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-gam-model-effect
#| fig-cap: A Curvilinear Relationship
# df_happiness |>
read_csv('https://tinyurl.com/worldhappiness2018') |>
ggplot(aes(healthy_life_expectancy_at_birth, happiness_score)) +
geom_point() +
geom_smooth(method = 'gam', se = FALSE) +
labs(x = 'Life Expectancy', y = 'Happiness\nScore')
ggsave("img/lm-extend-nonlinear_effect.svg", width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![A Curvilinear Relationship](img/lm-extend-nonlinear_effect.svg){#fig-gam-model-effect}
:::
There are many ways to go about this, and common approaches include using polynomial regression, using feature transformations, or using a nonlinear regression model. The curved line in @fig-gam-model-effect is the result of what's called a **spline**. It is created by a feature and expanding it to multiple columns, each of which is a function of the original feature. We then fit a linear model to that data as usual. What this ultimately means is that we can use a linear model to fit a curve through the data. While this might not give us the same tidy explanation that a typical linear approach would offer, we will certainly get better predictions if it's appropriate, and a better understanding of the reality and complexity of the true relationship. It's also useful for exploratory purposes, and visualization tools can make it easy. Here is some demonstration code (result not shown).
<!-- [^ggplotly]: [Plotly]{.pack} is directly available in R and Python, and [plotnine]{.pack} is the [ggplot]{.pack} equivalent in Python. -->
:::{.panel-tabset}
##### R
```{r}
#| eval: false
#| label: r_gam_ggplot
x = rnorm(1000)
y = sin(x)
tibble(x, y) |>
ggplot(aes(x = x, y = y)) +
geom_smooth(method = 'gam', se = FALSE)
```
##### Python
```{python}
#| eval: false
#| label: python_model_gam
import plotly.graph_objects as go
import numpy as np
x = np.random.normal(size = 1000)
y = np.sin(x)
fig = go.Figure()
fig.add_trace(
go.Scatter(
x = x,
y = y,
line_shape = 'spline'
)
)
```
:::
Models incorporating this type of effect belong to a broad group of *generalized additive models* (**GAM**s). When we explored interactions and mixed models, we explored how the feature-target relationship varies with another feature. There we focused on our feature and its relationship to the target at different values of other features. When we use a GAM, our initial focus is just on a specific feature and how its relationship with the target changes at different of that feature's values. How are we going to do this, you might ask? Conceptually, we will have a model that looks like this:
$$
y = f(X) + \epsilon
$$
This isn't actually any different than what you've seen - it really isn't! It's just shorthand for the input $X$ being fed into a function $f()$ of some kind, exactly as we saw in @sec-models-expression-math. That function is very flexible, and can be anything you want.
What we're going to do with the function now is expand the feature $x$ in some way, which on a practical level means it will actually become multiple columns of input instead of just one. Some approaches to creating the expansion can be quite complex, with the goal of tackling spatial, temporal, or other aspects of the data. But, in the end, it's just extra columns in the **model matrix** that go into the model fitting function like any other feature. This enables the model to explore a more complex representation space, ultimately helping us capture non-linear patterns in our data.
At this point, you might be asking yourself, "Why couldn't I just use some type of polynomial regression or even a nonlinear regression?". Of course you could, but both have limitations relative to a GAM. If you are familiar with polynomial regression, where we add columns that are squares, cubes, etc. of the original feature, you can think of GAMs as a more general approach, and very similar in spirit. But that polynomial approach assumes a specific form of nonlinearity, and has no regularization. This means that it tends to overfit the data you currently have, and *you* are forcing curves to fit through the data, rather than exploring what naturally may arise.
Another approach besides polynomials is to use a **nonlinear regression model**. In this setting, you need to know what the underlying functional form is. An example is a [logistic growth curve model for population growth](https://en.wikipedia.org/wiki/Population_growth#Logistic_equation). Without taking extra steps, such models can also overfit. Furthermore, outside of well-known physical, chemical, or biological processes, it's rarely clear what the underlying functional form should be. At the very least, we wouldn't know a formula for life expectancy and happiness!