-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathlinear_models.qmd
1618 lines (1126 loc) · 96.3 KB
/
linear_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# The Foundation {#sec-foundation}
![](img/chapter_gp_plots/gp_plot_1.svg){width=75% fig-align='center'}
:::{.content-visible when-format='html'}
> It is the chief characteristic of data science that it works.
― Isaac Asimov (paraphrased)
:::
```{r}
#| echo: false
#| label: r-display-setup
```
```{python}
#| echo: false
#| label: py-display-setup
import numpy as np
import pandas as pd
np.set_printoptions(precision = 4, suppress=True, floatmode = 'maxprec') # suppress is for whether to use scientific notation for otherwise rounded to zero numbers; apparently is either ignored or quarto doesn't pass
# https://github.com/numpy/numpy/issues/11048 # seriously wtf
pd.set_option('display.precision', 4) # number of digits of precision for floating point output
np.set_printoptions(formatter={'float': lambda x: '{0:0.4f}'.format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
```
Now it's time to dive into some modeling! We'll start things off by covering the building block of all modeling, and a solid understanding here will provide you the basis for just about anything that comes after, no matter how complex it gets. The **linear model** is our starting point. At first glance, it may seem like a very simple model, and it is, relatively speaking. But it's also quite powerful and flexible, able to take in different types of inputs, handle nonlinear relationships, temporal and spatial relations, clustering, and more. Linear models have a long history, with even the formal and scientific idea behind correlation and linear regression being well over a century old[^corbypeirce]! And in that time, the linear model is far and away the most used model out there. But before we start talking about the *linear* model, we need to talk about what a **model** is in general.
[^corbypeirce]: Regression in general is typically attributed to [Galton](https://en.wikipedia.org/wiki/Francis_Galton), and correlation to Pearson, whose coefficient bearing his name is still the most widely used measure of association. Peirce & Bowditch were actually ahead of both [@rovine_peirce_2004], but [Bravais beat all of them](https://en.wikipedia.org/wiki/Auguste_Bravais).
## Key Ideas {#sec-lm-key-ideas}
To get us started, we can provide a few concepts key to understanding linear models. We'll cover each of these as we go along.
- The **linear model** is a foundation on which one can build an understanding for practically any other model.
- It combines the strength of its inputs - features - to predict a target.
- Prediction is fundamental to assessing and using a model.
- There are many ways to interpret a model, at the feature level and as a whole.
- It is rather easy to start building upon and adding complexity to a linear model.
- All models come with assumptions, and it's good to be aware of these.
- The things you learn here will be useful in many other contexts - such as other types of models, for classification tasks, and more!
As we go along, be sure that you feel you have the 'gist' of what we're talking about. Almost everything that goes beyond linear models builds on what's introduced here, so it's important to have a firm grasp before climbing to new heights.
### Why this matters {#sec-lm-why-this-matters}
The basic linear model and how it comes about underpins so many other models, from the simplest t-test to the most complex neural network. It provides a powerful foundation, and it is a model that you'll see in many different contexts. It's also a model that is relatively easy to understand, so it's a great place to start!
### Helpful context {#sec-lm-good-to-know}
We're just starting out here, but we're kind of assuming you've had some exposure to the idea of statistical or other models, even if only from an interpretation standpoint or visualizations of various relationships. We assume you have an understanding of basic stats like central tendency (e.g., a mean or median), variance, and correlation, stuff like that. And if you intend to get into the code examples, you'll need a basic familiarity with Python or R.
## *THE* Linear Model {#sec-lm-linear-model}
The linear model is perhaps the simplest *functional* model we can use to express a relationship between the features that serve as inputs, and the targets we hope to explain with them. And because of that, it's possibly still the most common model used in practice, and it is the basis for many types of other models. So why don't we do one now?
The following dataset has individual [movie reviews](#sec-dd-movie-reviews) containing the movie rating (1-5 stars scale), along with features pertaining to the review (e.g., word count), those that regard the reviewer (e.g., age) and features about the movie (e.g., genre, release year).
```{r}
#| echo: false
#| eval: false
#| cache: false
#| label: import-review
library(tidyverse)
df_reviews = read_csv('data/movie_reviews.csv')
skimr::skim(df_reviews)
```
For our first linear model, we'll keep things simple. Let's predict the rating based on the word count of the review. We'll use the `lm()` function in R and the `ols()` function in Python[^smfolsR] to fit the model. Both functions take a formula as the first argument, which, as we noted elsewhere (@sec-models-expressing-code), is just a way of expressing the relationship between the features and target. The formula is displayed as `y ~ x1 + x2 + ...`, where `y` is the target name and `x*` are the feature names. We also need to specify what the data object is, typically a data frame, where the features and target are found.
[^smfolsR]: We use the `smf.ols` approach because it is modeled on the R approach.
:::{.panel-tabset}
##### R
```{r my-first-model}
#| label: r-my-first-model
# all data found on github repo
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
model_lr_rating = lm(rating ~ word_count, data = df_reviews)
summary(model_lr_rating)
```
```{r}
#| echo: false
#| eval: false
#| label: save-r-model_lr_rating
# only run as needed
save(model_lr_rating, file = 'data/model_lr_rating.RData')
```
##### Python
<!--
latex fix for pdf due to ridiculous 'Notes' output that should be documented and not part of the output (/vent). Note that this suggestion affects the code too, so not any better than just using the standard text size change.
https://stackoverflow.com/questions/76391230/change-the-font-size-of-the-output-of-python-code-chunk
-->
```{python}
#| label: py-my-first-model
#| results: hide
import numpy as np
import pandas as pd
import statsmodels.formula.api as smf
# all data found on github repo
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
model_lr_rating = smf.ols('rating ~ word_count', data = df_reviews).fit()
model_lr_rating.summary(slim = True)
```
\small
```{python}
#| echo: false
#| label: py-my-first-model-show
model_lr_rating.summary(slim = True)
```
\normalsize
```{python}
#| echo: false
#| eval: false
#| label: save-py-model_lr_rating
# only run as needed
model_lr_rating.save('linear_models/data/model_lr_rating.pkl')
```
:::
```{r}
#| echo: false
#| label: my-first-model-coefs-etc
intercept = round(coef(model_lr_rating)[1], 2)
wc_coef = round(coef(model_lr_rating)[2], 2)
sd_y = round(sd(df_reviews$rating), 2)
sd_x = round(sd(df_reviews$word_count), 1)
rl_ci = round(confint(model_lr_rating)[2, ], 3)
n_words = round(mean(df_reviews$word_count))
model_rmse = round(performance::performance_rmse(model_lr_rating), 2)
model_rsq = round(performance::r2(model_lr_rating)$R2, 2)
```
For such a simple model, we certainly have a lot to unpack here! Don't worry, you'll eventually come to know what it all means. But it's nice to know how easy it is to get the results! For now we can just say that there's a *negative* relationship between the word count and the rating (the `r round(coef(model_lr_rating)[2], 3)`), which means that we expect lower ratings with longer reviews. The output also tells us that the value regarding the relationship is statistically significant (`P(>|t|)` value is < .05).
Getting more into the details, we'll start with the fact that the linear model posits a **linear combination** of the features. This is an important concept to understand, but really, a linear combination is just a sum of the features, each of which has been multiplied by some specific value. That value is often called a **coefficient**, or possibly **weight**, depending on the context, and will allow different features to have different contributions to the result. Those contributions reflect the amount and direction of the feature-target relationship. The linear model is expressed as (math incoming!):
$$
y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n
$$ {#eq-lm-basic}
- $y$ is the target.
- $x_1, x_2, ... x_n$ are the features.
- $b_0$ is the intercept, which is kind of like a baseline value or offset. If we had no features at all it would just be the mean of the target.
- $b_1, b_2, ... b_n$ are the coefficients or weights for each feature.
But let's start with something simpler, let's say you want to take a sum of several features. In math you would write it as:
$$
x_1 + x_2 + ... + x_n
$$
In this equation, $x$ is the feature and n is the number identifier for the features, so $x_1$ is the first feature (e.g., word count), $x_2$ the second (e.g., movie release year), and so on. $x$ is an arbitrary designation - you could use any letter, symbol you want, or even better, would be the actual feature name. Now look at the linear model.
$$
y = x_1 + x_2 + ... + x_n
$$
In this case, the function is *just a sum*, something so simple we do it all the time. In the linear model sense though, we're actually saying a bit more. Another way to understand that equation is that *y is a function of x*. We don't show any coefficients here, i.e., the *b*s in our initial equation (@eq-lm-basic), but technically it's as if each coefficient was equal to a value of 1. In other words, for this simple linear *model*, we're saying that each feature contributes in an identical fashion to the target.
In practice, features will never contribute in the same ways, because they correlate with the target differently, or are on different scales. So if we want to relate some features, $x_1$ and $x_2$, to target $y$, we probably would not assume that they both contribute in the same way. For instance, we might assign more weight to $x_1$ than $x_2$, for whatever reason. In the linear model, this is expressed by multiplying each feature by a different coefficient or weight. So the linear model's primary component is really just a sum of the features multiplied by their coefficients, i.e., a *weighted sum*. Each feature's contribution to explaining or accounting for the target is proportional to its coefficient. So if we have a feature $x_1$ and a coefficient $b_1$, then the contribution of $x_1$ to the target is $b_1\cdot x_1$. If we have a feature $x_2$ and a coefficient $b_2$, then the contribution of $x_2$ to the target is $b_2 \cdot x_2$. And so on. So the linear model is really just a sum of the features multiplied by their respective weights.
For our specific model, here is the mathematical representation:
$$
\textrm{rating} = b_0 + b_1 \cdot \textrm{word\_count}
$$
And with the actual results of our model:
```{r}
#| echo: false
#| results: 'asis'
#| label: word-count-linear-model
init = "
$$
\\textrm{rating} = "
end = " \\cdot \\textrm{word\\_count}
$$"
glue("{init}{intercept} + {wc_coef}{end}")
```
Not too complicated we hope! But let's make sure we see what's going on here just a little bit more.
- Our *idea* is that the length of the review in words is in some way related to the eventual rating given to the movie.
- Our *target* is the movie's rating by a reviewer, and the *feature* is the word count
- We *map the feature to the target* via the linear model, which provides an initial understanding of how the feature is related to the target. In this case, we start with a baseline of `r intercept`. This value makes sense only in the case of a rating with no review, and is what we would guess if the word count was 0. But we know there are reviews for every observation, so it's not very meaningful as is. We'll talk about ways to get a more meaningful intercept later, but for now, that is our starting point. Moving on, if we add a single word to the review, we expect the rating to `r word_sign(wc_coef, c('increase', 'decrease'))` by `r wc_coef` stars. So if we had a review that was `r n_words` words long, i.e., the mean word count, we would predict a rating of `r paste(intercept)` + `r glue('{n_words}*{wc_coef}')` = `r round(predict(model_lr_rating, newdata = data.frame(word_count = n_words)), 1)` stars.
### The linear model visualized {#sec-lm-graph}
Given our single feature model, we can easily plot the relationship between the word count and the rating. Now we can visually see the negative or downward-sloping relationship.
```{r}
#| echo: false
#| eval: false
#| label: fig-word-count-linear-model
#| fig-cap: Linear Model of Rating vs. Word Count
library(geomtextpath)
df_reviews |>
mutate(
preds = predict(model_lr_rating, newdata = data.frame(word_count = word_count)),
# txt =
) |>
ggplot(aes(x = word_count, y = rating)) +
geom_smooth(method = 'lm', se = FALSE, linewidth = 2) +
geomtextpath::geom_textline(
aes(y = preds),
label = glue::glue('Predicted rating = {intercept} + {wc_coef} * word count'),
size = 5,
vjust = -1,
linewidth = 0
) +
geom_point(
x = 0.1,
y = intercept,
size = 5,
color = 'gray50'
) +
# annotate(
# 'text',
# x = n_words,
# y = predict(model_lr_rating, newdata = data.frame(word_count = n_words)),
# label = glue::glue('Predicted rating = {intercept} + {wc_coef} * word count'),
# hjust = 0,
# vjust = -2,
# size = 8,
# color = 'gray25',
# fontface = 'bold',
# family = 'Roboto Condensed'
# ) +
lims(
x = c(0, max(df_reviews$word_count)),
y = c(0, 5)
) +
labs(
# title = 'Rating vs. Word Count',
x = 'Word Count',
y = 'Rating',
caption = glue('Dot is the intercept, what the predicted rating is with zero words, line is slope of {wc_coef}')
)
ggsave('img/lm-word_count_linear_model.svg')
```
![Linear Model Result of Rating vs. Word Count](img/lm-word_count_linear_model.svg){width=75% #fig-word-count-linear-model}
We can also express the linear model as a *graph* or *graphical model*, which can be a very useful way to think about models in a visual fashion. As we come across other models, a visualization like this can help us see both how different models relate to one another, and are actually very similar to one another. In the following example, we have three features predicting a single target, so we have three 'nodes' for the features, and a single node for the target. The feature nodes are combined into a linear combination to produce the **output** of the model. In the context of linear regression, the initial combination is often called the **linear predictor**. Each 'edge' signifies the connection of a feature to the combined result, and is labeled with the coefficient or weight. The connection between the linear predictor and final model output is direct, without any additional change from the linear predictor stage. This output will ultimately be compared to the observed target value to assess model performance. We'll return to this depiction a little bit later in this chapter(@sec-lm-classification), and other parts as well, but for our standard linear model, we're all set.
```{r}
#| echo: false
#| label: fig-graph-lm
#| fig-cap: Linear Model as a Graph
g = DiagrammeR::grViz('img/graphical_lm.dot')
g |>
DiagrammeRsvg::export_svg() |>
charToRaw() |>
rsvg::rsvg_svg('img/graphical_lm.svg')
```
![A linear regression as a graphical model](img/graphical_lm.svg){width=75% #fig-graph-lm}
So at this point you have the basics of what a linear model is and how it works, and a couple ways to think about it, whether through programming, math, or just visually. But there is a lot more to it than that. Just getting the model is easy enough, but we need to be able to use it and understand the details better, so we'll get into that now!
## What Do We Do with a Model? {#sec-lm-what-do-we-do}
Once we have a working model, there are two primary ways we can use it. One way to use a model is to help us understand the relationships between the features and our outcome of interest. In this way, the focus can be said to be on **explanation**, or interpreting the model results. The other way to use a model is to make estimates about the target for specific observations, often ones we haven't seen in our data. In this case, the focus is on **prediction**. In practice, we often do both, but the focus is usually on one or the other. We'll cover both in detail eventually, but let's start with prediction.
### Prediction {#sec-lm-prediction}
It may not seem like much at first, a model's utility often lies in its ability to make predictions about the world around us. Once our model has been *fit* to the data, we can obtain our predictions by plugging in values for the features that we are interested in, and, using the corresponding weights and other parameters that have been estimated, come to a guess about a specific observation. Let's go back to our results, shown in the following table.
\small
```{r}
#| echo: false
#| label: tbl-my-first-model-output-2
#| tbl-cap: Linear Model Output
broom::tidy(model_lr_rating, conf.int = TRUE) |>
janitor::clean_names() |>
rename(feature = term) |>
mutate(feature = ifelse(feature == '(Intercept)', 'intercept', feature)) |>
gt()
```
\normalsize
The table shows the **coefficient** for the feature and the intercept, which is our starting point. In this case, the coefficient for word count is `r wc_coef`, which means that for every additional word in the review, the rating goes `r ifelse(sign(wc_coef) == 1, 'up', 'down')` by `r wc_coef` stars. So if we had a review that was `r n_words` words long, we would *predict* a rating of `r paste(intercept)` + **`r glue('{n_words}*{wc_coef}')`** = `r round(predict(model_lr_rating, newdata = data.frame(word_count = n_words)), 1)` stars.
When we're talking about the predictions (or outputs) for a linear model, we usually will see this as the following mathematically:
$$
\hat{y} = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n
$$ {#eq-lm-prediction}
What is $\hat{y}$? The 'hat' over the $y$ just means that it's a predicted, or 'expected', value of the model, i.e., the output. This distinguishes it from the target value we actually observe in the data. Our first equations that just used $y$ implicitly suggested that we would get a perfect rating value given the model, but that's not the case. We can only get an estimate. The $\hat{y}$ is also the linear predictor in our graphical version (@fig-graph-lm), which makes clear it is not the actual target, but a combination of the features that is related to the target.
To make our first equation (@eq-lm-basic) accurately reflect the relationship between the target and our features, we need to add what is usually referred to as an **error term**, $\epsilon$, to account for the fact that our predictions will not be perfect[^perfect_prediction]. So the full linear (regression) model is:
[^perfect_prediction]: In most circumstances, if you ever have perfect prediction, or even near perfect prediction, the usual issues are that you have either asked a rather obvious/easy question of your data (e.g., predicting whether an image is of a human or a car), or have accidentally included the target in your features (or a combination of them) in some way.
$$
y = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n + \epsilon
$$ {#eq-lm-observed}
The error term is a random variable that represents the difference between the actual value and the predicted value, which comes from the weighted combination of features. We can't know what the error term is, but we can estimate its values, often called **residuals** or just prediction errors, as well as parameters associated with it, just like we can the coefficients. We'll talk more about that in the section on estimation (@sec-estimation).
Another way to write the model formally is:
$$y|X,\beta, \sigma = \mathrm{N}(\mu, \sigma^2)$$
$$
\mu = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n
$$ {#eq-lm-formal}
or
$$
y | X,\beta, \sigma = b_0 + b_1x_1 + b_2x_2 + ... + b_nx_n + \epsilon
$$
$$
\epsilon \sim \mathrm{N}(0, \sigma^2)
$$ {#eq-lm-error}
This makes explicit that the target is assumed to be **conditionally** normally distributed with a mean of based on the linear combination of the features, and a variance of $\sigma^2$. What do we mean by *conditionally*? This means that, given the features and the estimated model parameters, the target follows a normal distribution ($\mathrm{N}()$). This is the standard assumption for linear regression, and it's a good one to start with, but it's not our only option. We'll talk more about this later in this chapter (@sec-lm-assumptions), and see what we might do differently in @sec-glm. We will also see that we can estimate the model parameters without any explicit reference to a probability distribution @sec-estimation.
:::{.callout-note title='Predictions by any other name...' collapse='true'}
You'll often see predictions referred to as **fitted values**, but these imply we are only talking about the observed data features the model was trained on or 'fit' to. Predictions can also be referred to as **expected values**, **estimates**, **outputs**, or **forecasts**, the latter is especially common in time series analysis. Within generalized linear models and others where there may ultimately be a transformation of the output, you may see it referred to as a **linear predictor**.
:::
### What kinds of predictions can we get? {#sec-lm-prediction-types}
What predictions we can get depends on the type of model we are using. For the linear model we have at present, we can get predictions for the target, which is a **continuous variable**. Very commonly, we also can get predictions for a **categorical target**, such as whether the rating is 'good' or 'bad'. This simple breakdown pretty much covers everything, as we typically would be predicting a continuous numeric variable or a categorical variable, or more of them, like multiple continuous variables, or a target with multiple categories, or sequences of categories (e.g., words).
In our case, we can get predictions for the rating, which is a number between 1 and 5. Had our target been a binary good vs. bad rating, our predictions would still be numeric for most models, and usually expressed as a probability between 0 and 1, say, for the 'good' category, or in an initial form that is then transformed to a probability. For example, in the context of predicting a good rating, higher probabilities would mean we'd more likely predict the movie is good, and lower probabilities would mean we'd more likely predict the movie is bad. We then would convert that probability to a class of good or bad depending on a chosen probability cutoff. We'll talk about how to get predictions for categorical targets later[^treepreds].
[^treepreds]: Some models, such as the tree approaches outlined in @sec-ml-common-trees, can directly predict categorical targets, but we still like, and often\ prefer using a probability.
We previously saw a prediction for a single observation where the word count was 10 words, but we can also get predictions for multiple observations at once. In fact, we can get predictions for all observations in our data. Besides that, we can also get predictions for observations that we don't even have data for, which is a very neat thing to be able to do! The following shows how we can get predictions for all data, and for a single observation with a word count of [^tibble].
[^tibble]: Some not as familiar with R should be aware that tibbles are a type of data frame. The name distinguishes them from the standard data frame, and they have some additional features that make them more user friendly.
::: {.panel-tabset}
##### R
```{r}
#| label: my-first-model-predictions-r
all_predictions = predict(model_lr_rating)
df_prediction = tibble(word_count = 5)
single_prediction = predict(model_lr_rating, newdata = df_prediction)
```
##### Python
```{python}
#| label: my-first-model-predictions-py
all_predictions = model_lr_rating.predict()
df_prediction = pd.DataFrame({'word_count': [5]})
single_prediction = model_lr_rating.predict(df_prediction)
```
:::
Here is a plot of our predictions for the observed data versus the actual ratings[^jitter].
The reference line is where the points would fall if we had perfect prediction. We can see that the predictions are definitely not perfect, but we don't expect this. They are not completely off base either, in that generally higher predicted scores are associated with higher observed values. We'll talk about how to assess the quality of our predictions later, but we can at least get a sense that we have a correspondence between our predictions and target, which is definitely better than not having a relationship at all!
[^jitter]: Word count is **discrete**- it can only take whole numbers like 3 or 20, and it is our only feature. Because of this, we can only make very limited predicted rating values, while the observed rating can take on many other values. Because of this, the raw plot would show a more banded result with many points overlapping, so we use a technique called **jittering** to move the points around a little bit so we can see them all. The points are still roughly in the same place, but they are moved around a little bit so we can see them all.
```{r}
#| echo: false
#| eval: false
#| label: fig-my-first-model-predictions-plot
#| fig-cap: Predicted vs. Observed Ratings
p = df_reviews |>
mutate(pred = all_predictions) |>
ggplot(aes(pred, rating)) +
geom_point(position = position_jitter()) +
geom_abline(intercept = 0, slope = 1, color = okabe_ito['darkblue'], size = 2) +
labs(
x = 'Predicted Rating',
y = 'Observed\nRating',
title = '',
# title = 'Predictions vs. Actual Ratings',
caption = 'Points have been jittered for better visibility.'
) +
theme(
plot.caption = element_text(vjust = -2)
)
p
# p_gray = p + scale_color_grey()
ggsave('img/lm-my-first-model-predictions-plot.svg', width = 8, height = 6)
```
![Predicted vs. Observed Ratings](img/lm-my-first-model-predictions-plot.svg){#fig-my-first-model-predictions-plot width=75%}
We saw what our prediction looks like for a single observation, and now we'll add in a few more- one for 10 words, and one for a 50 word review, which is beyond the length of any review in this dataset, and one for 12.3 words, which isn't even possible for this data, since words are only counted as whole values. To get these values we just use the same prediction approach as before, and we specify the word count value we want to predict for.
```{r}
#| echo: false
#| label: tbl-predictions
#| tbl-cap: Predictions for Specific Observations
tibble(
word_count = c(5, 10, 12.3, 50),
pred = predict(model_lr_rating, newdata = data.frame(word_count = word_count))
) |>
rename('Word Count' = word_count, 'Predicted Rating' = pred) |>
gt() |>
fmt_number(decimals = 1)
```
The values reflect the `r word_sign(wc_coef, c('positive', 'negative'))` coefficient from our model, showing `r word_sign(wc_coef, c('increasing', 'decreasing'))` ratings with increasing word counts. Furthermore, we see the power of the model's ability to make predictions for what we don't see in the data. Maybe we limited our data review size, but we know there are 50 word reviews out there, and we can still make a guess as to what the rating would be for such a review. Maybe in another case, we know a group of people who have on average 12.3 word reviews, and we can make a guess as to what the predicted rating would be for that group. Our model doesn't literally know anything about the context of the data, but we can use our knowledge to make predictions that are meaningful to us. This is a very powerful capability, and it's one of the main reasons we use models in the first place.
### Prediction error {#sec-lm-prediction-error}
As we have seen, predictions are not perfect, and an essential part of the modeling endeavor is to better understand these errors and why they occur. In addition, error assessment is the fundamental way in which we assess a model's performance, and, by extension, compare that performance to other models. In general, prediction error is the difference between the actual value and the predicted value or some function of it. In statistical models, it is also often called the **residual**. We can look at these individually, or we can look at them in aggregate with a single metric.
Let's start with looking at the residuals visually. Often the modeling package you use will have this as a default plotting method when doing a standard linear regression, so it's wise to take advantage of it. We plot both the distribution of raw error scores and the cumulative distribution of absolute prediction error. Here we see a couple things. First, the distribution appears roughly normal, which is a good thing, since statistical linear regression assumes our error is normally distributed, and the prediction error serves as an estimate of that. Second, we see that the mean of the errors is zero, which is a property of linear regression, and the reason we look at other metrics besides a simple 'average error' when assessing model performance. We can also see that our average *absolute* error is around `r round(mean(abs(resid(model_lr_rating))), 1)`, most of our predictions (>90%) are within ±1 star rating.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-my-first-model-error-plot
#| fig-cap: Distribution of Prediction Errors
#| out-width: 100%
# not clear what after_stat actually does here ('count' by itself is not related to any count from the histogram), so 7 is a magic number to align the density with the histogram
p_hist = tibble(
error = resid(model_lr_rating)
) |>
ggplot(aes(error)) +
geom_histogram(alpha = .8) +
geom_density(aes(y = after_stat(count / 7)), color = okabe_ito['red'], linewidth = 1) +
geom_vline(xintercept = 0, color = okabe_ito['darkblue'], size = 1.5, alpha = 1) +
labs(
x = 'Prediction Error',
y = 'Count',
title = 'Distribution of Prediction Errors',
) +
theme(
plot.caption = element_text(vjust = -2)
)
p_ecdf = tibble(res = abs(resid(model_lr_rating))) |>
arrange(res) |>
mutate(prop = 1 - row_number() / n()) |>
ggplot(aes(x = res)) +
geom_vline(xintercept = 1, color = okabe_ito['darkblue'], alpha = 1) +
geom_vline(xintercept = mean(abs(resid(model_lr_rating))), color = okabe_ito['darkblue'], alpha = 1) +
stat_ecdf(size = 1, color = okabe_ito['orange']) +
stat_ecdf(aes(ymin = 0, ymax = ..y..), size = 1, fill = okabe_ito['orange'], geom = 'ribbon', alpha = .1) +
scale_y_continuous(breaks = seq(10, 100, 10) / 100, labels = scales::percent) +
labs(
x = 'Absolute Prediction Error',
y = 'Cumulative\nProportion',
# title = 'Prediction Errors',
caption = "Vertical lines show mean absolute error\nand a '1 star' error."
) +
theme(
plot.caption = element_text(vjust = -2)
)
p_hist + p_ecdf &
theme(
plot.title = element_text(size = 12),
axis.title.x = element_text(size = 10),
axis.title.y = element_text(size = 10)
)
ggsave('img/lm-my-first-model-error-plot.svg', width = 8, height = 6)
```
:::
:::{.content-visible when-format='pdf'}
![Distribution of Prediction Errors](img/lm-my-first-model-error-plot.svg){width=100% #fig-my-first-model-error-plot}
:::
```{r}
#| echo: false
#| label: get_worst_prediction
worst_prediction = df_reviews |>
mutate(prediction = predict(model_lr_rating)) |>
slice(which.max(abs(resid(model_lr_rating)))) |>
select(rating, prediction, word_count)
```
Of more practical concern is that we don't see extreme values or clustering, which might indicate a failure on the part of the model to pick up certain segments of the data. It can still be a good idea to look at the extremes just in case we can pick up on some aspect of the data that we could potentially incorporate into the model. So looking at our worst prediction in absolute terms, we see the observation has a typical word count, and so our simple model will just predict a fairly typical rating. But the actual rating is `r worst_prediction$rating`, which is `r abs(round(worst_prediction$rating - worst_prediction$prediction, 1))` away from our prediction, a very noticeable difference. Further data inspection may be required to figure out why this came about, and this is a process you should always be prepared to do when you're working with models.
```{r}
#| echo: false
#| label: tbl-worst-prediction
#| tbl-cap: Worst Prediction
worst_prediction |>
gt() |>
fmt_number(c(rating, prediction), decimals = 1) |>
fmt_number(word_count, decimals = 0)
```
### Prediction uncertainty
We can also look at the uncertainty of our predictions, which is a measure of how much we expect our predictions to vary. Not only are they off from the observed value, our predictions themselves are just a guess based on the data we have, and we'd like to know how much we can trust them.
This trust is often expressed as an interval range of values that we expect our prediction to fall within, with a certain level of confidence. But! There are actually two types of intervals we can get. One is really about the mean prediction, or *expected value* we would get from the model at that observation. This is usually called a **confidence interval**. The other type of interval is based on the model's ability to predict new data, and is often called a **prediction interval**. This interval is about the actual prediction we would get from the model for any value, whether it was data we had seen before or not. Because of this, the prediction interval is always wider than the confidence interval, and it's the one we usually want to use when we're making predictions about new data.
Here is how we can obtain these from our model.
:::{.panel-tabset}
##### R
```{r}
#| label: my-first-model-prediction-intervals-r
#| eval: true
#| results: hide
prediction_CI = predict(
model_lr_rating,
newdata = df_prediction,
se.fit = TRUE, # standard error of the fit
interval = 'confidence'
)
prediction_PI = predict(
model_lr_rating,
newdata = df_prediction,
se.fit = TRUE,
interval = 'prediction'
)
pred_intervals = bind_rows(
as_tibble(prediction_CI$fit),
as_tibble(prediction_PI$fit),
) |> mutate(
interval = c('confidence', 'prediction'),
type = c('mean', 'observation')
)
pred_intervals
```
##### Python
```{python}
#| label: my-first-model-prediction-intervals-py
#| eval: false
#| out-width: 100%
# contains both confidence ('mean_') and prediction ('obs_') intervals
pred_intervals = (
model_lr_rating
.get_prediction(df_prediction)
.summary_frame(alpha = 0.05)
)
pred_intervals
```
:::
```{r}
#| echo: false
#| label: tbl-prediction-intervals
#| tbl-cap: Prediction Intervals for Specific Observations
pred_intervals |>
select(interval, type, fit:upr) |>
gt() |>
fmt_number(c('fit', 'lwr', 'upr'), decimals = 2)
```
As expected, our prediction interval is wider than our confidence interval, and we can see that the prediction interval is quite wide- from a rating of `r round(pred_intervals$lwr[2], 1)` to `r round(pred_intervals$upr[2], 1)`. This is a consequence of the fact that we have a lot of uncertainty in our predictions for new observations, and we can't expect to get a very precise prediction from our model with only one feature. This is a common issue with many models, and one that having a better model can help remedy. We can also plot these intervals across a range of values to get a better sense of what they look like. Let's do so for all observed word counts.
```{r}
#| echo: false
#| eval: false
#| label: fig-prediction-intervals
#| fig-cap: Prediction & Confidence Intervals Compared
prediction_CI = predict(
model_lr_rating,
se.fit = TRUE,
interval = 'confidence'
)
prediction_PI = predict(
model_lr_rating,
se.fit = TRUE,
interval = 'prediction'
)
pred_intervals = bind_rows(
as_tibble(prediction_CI$fit) |> mutate(interval = 'Confidence'),
as_tibble(prediction_PI$fit) |> mutate(interval = 'Prediction'),
) |>
mutate(word_count = rep(df_reviews$word_count, 2))
pred_intervals |>
ggplot(aes(word_count, fit)) +
geom_ribbon(
aes(ymin = lwr, ymax = upr, fill = interval),
alpha = .25,
data = pred_intervals |> filter(interval == 'Prediction')
) +
geom_ribbon(
aes(ymin = lwr, ymax = upr, fill = interval),
alpha = .8,
data = pred_intervals |> filter(interval == 'Confidence')
) +
geom_line() +
# geom_point(data = df_reviews, aes(word_count, rating), color = okabe_ito[['darkblue']]) +
scale_fill_manual(values = c(okabe_ito[['orange']], okabe_ito[['orange']])) +
labs(
x = 'Word Count',
y = 'Predicted Rating',
title = 'Prediction Interval Estimates for Word Count',
color = 'Interval Type'
)
ggsave('img/lm-prediction-intervals-plot.svg', width = 8, height = 6)
```
![Prediction & Confidence Intervals Compared](img/lm-prediction-intervals-plot.svg){width=100% #fig-prediction-intervals}
Once you move past simpler linear models and generalized linear models, obtaining uncertainty estimates for predictions is difficult, and tools to do so can be scarce. This is especially the case for models used in machine learning contexts, and relatively rare for deep learning approaches. In practice, you can use bootstrapping (@sec-unc-bootstrap) to get a sense of the uncertainty, but this is often not a good estimate in many data scenarios and can be computationally expensive. Bayesian approaches (@sec-unc-bayes) can also provide estimates of uncertainty, but likewise are computationally expensive, and require a good deal of expertise to implement for more complex settings. Quantile regression (@sec-lm-extend-quantile) can sometimes be appropriate to estimate predictions at different quantiles that can serve as a proxy for prediction intervals, but tools to do so for various models are uncommon. On the other hand, **conformal prediction** tools are becoming more popular, and can provide a more reliable estimate of prediction uncertainty for any type of model. Yet they too are computationally expensive for more accurate estimates, and good tools are only recently becoming available[^confpred].
[^confpred]: For a good intro to conformal prediction, see @angelopoulos_gentle_2022. The [mapie]{.pack} is a good tool for Python, and the [tidymodels family](https://www.tidymodels.org/learn/models/conformal-regression/) has recently added this functionality via the [probably]{.pack} package. Michael has a mostly complete blog post on conformal prediction that got lost in the shuffle, but will be available at some point as well.
So at this point you have the gist of prediction, prediction error, and uncertainty in a prediction, but there is still more to modeling! We'll come back to global assessments of model error very shortly, and even more detail can be found in [Chapter -@sec-knowing] where we dive deeper into our models and how they work, and [Chapter -@sec-estimation], where we see how to estimate the parameters of our model by picking those that will reduce the prediction error the most. For now though, let's move on to the other main use of models, **explanation**, where the focus is on understanding the relationships between the features and the target.
## How Do We Interpret the Model? {#sec-lm-interpretation}
When it comes to interpreting the results of our model, there are a lot of tools at our disposal, though many of the tools we can ultimately use will depend on the specifics of the model we have employed. In general though, we can group our approach to understanding results at the **feature level** and the **model level**. A feature level understanding regards the relationship between a single feature and the target. Beyond that, we also attempt comparisons of feature contributions to prediction, i.e., relative importance. Model level interpretation is focused on assessments of how well the model 'fits' the data, or more generally, predictive performance. We'll start with the feature level, and then move on to the model level.
### Feature level interpretation {#sec-lm-interpretation-feature}
As mentioned, at the feature level, we are primarily concerned with the relationship between a single feature and the target, for whatever features are of interest. More specifically, we are interested in the direction and magnitude of the relationship, but in general, it all boils down to how a feature induces change in the target. For numeric features, we are curious about the change in the target given some amount of change in the feature values. It's conceptually the same for categorical features, but often we like to express the change in terms of group mean differences or something similar, since the order of categories is not usually meaningful. An important aspect of feature level interpretation is the specific predictions we can get by holding the data at key feature values.
Let's start with the basics by looking again at our coefficient table from the model output.
\small
```{r}
#| echo: false
#| label: tbl-my-first-model-output
#| tbl-cap: Linear Regression Coefficients
broom::tidy(model_lr_rating, conf.int = TRUE) |>
janitor::clean_names() |>
rename(feature = term) |>
mutate(feature = ifelse(feature == '(Intercept)', 'intercept', feature)) |>
gt() |>
tab_style(
style = cell_text(weight = 'bold'),
locations = cells_body(columns = estimate)
)
```
\normalsize
Here, the main thing to look at are the actual feature coefficients and the direction of their relationship, positive or negative. The coefficient for word count is `r wc_coef`, and this means that for every additional word in the review, the rating goes `r ifelse(sign(wc_coef) == 1, 'up', 'down')` by `r wc_coef`. This interpretation gives us directional information, but how can we interpret the magnitude of the coefficient?
Let's try and use some context to help us. While a drop of `r round(wc_coef, 2)` might not mean much to us in terms of ratings, we might not be as sure about a change in one word for a review. However, we do know the standard deviation of the rating score, i.e., how much it moves around naturally on its own, is `r sd_y`. So the coefficient is about `r round(abs(wc_coef) / sd_y * 100, 0)`% of the standard deviation of the target. In other words, the addition of a single word to a review results in an expected `r word_sign(wc_coef, c('increase', 'decrease'))` of `r round(abs(wc_coef) / sd_y * 100, 0)`% of what the review would normally bounce around in value. We might not consider this large, but also, a single word change isn't much. What would be a significant change in word count? Let's consider the standard deviation of the feature. In this case, it's `r sd_x` for word count. So if we increase the word count by one standard deviation, we expect the rating to `r word_sign(wc_coef, c('increase', 'decrease'))` by `r glue('{wc_coef} * {sd_x} = {round(wc_coef * sd_x, 2)}')`. That decrease then translates to a change of `r round(wc_coef * sd_x, 2)`/`r sd_y` = `r round(sd_x/sd_y * wc_coef, 2)` standard deviation units of the target. Without additional context, many would think that's a significant change[^effectsize], or at the very least, that the coefficient is not negligible, and that the feature is indeed related to the target. But we can also see that the coefficient is not so large that it's not believable in this context.
[^effectsize]: Historically, people cite @cohen_statistical_2009 for effect size guidelines for simple models, but such guidelines are notoriously problematic. Rely on your own knowledge of the data, provide reasons for your conclusions, and let others draw their own. If you cannot tell what would constitute a notable change in your outcome of interest, you probably don't know the target well enough to interpret the model regarding it, and need to do some more research.
::: {.callout-note title='Standardized Coefficients' collapse='true'}
The calculation we just did results in what's often called a **standardized** or 'normalized' coefficient. In the case of the simplest model with only one feature like this, it is identical to the Pearson r correlation metric, which we invite you to check and confirm on your own. In the case of multiple features, it represents a (partial) correlation between the target and the feature, after adjusting for the other features. But before you start thinking of it as a measure of *importance*, it is not. It provides some measure of the feature-target linear relationship, but that does not entail *practical* importance, nor is it useful in the presence of nonlinear relationships, interactions, and a host of other interesting things that are typical to data and models.
:::
After assessing the coefficients, next up in our table is the **standard error**. The standard error is a measure of how much the coefficient varies from sample to sample. If we collected the data multiple times, even under practically identical circumstances, we wouldn't get the same value each time - it would bounce around a bit, and the standard error is an estimate of how much it would bounce around. In other words, the standard error is a measure of **uncertainty**, and along with the coefficients, it's used to calculate everything else in the table.
<!-- tab_style is ignored for pdf so put a conditional to only display for html -->
::: {.content-hidden when-format='pdf'}
```{r}
#| echo: false
#| label: tbl-my-first-model-output-se
#| tbl-cap: Linear Regression Statistical Output
broom::tidy(model_lr_rating, conf.int = TRUE) |>
janitor::clean_names() |>
rename(feature = term) |>
mutate(feature = ifelse(feature == '(Intercept)', 'intercept', feature)) |>
gt() |>
tab_style(
style = cell_text(weight = 'bold'),
locations = cells_body(columns = estimate:p_value)
)
```
:::
The statistic, here a t-statistic from the student t distribution[^coefstats], is the ratio of the coefficient to the standard error. This gives us a sense of the effect relative to its variability, but the statistic's primary use is to calculate the **p-value** related to its distribution[^pvaluet], which is the probability of seeing a coefficient as large (or larger) as the one we have, *if* we assume from the outset that the true value of the coefficient is zero, *and* the model assumptions are true as well. In this case, the p-value is `r scales::label_scientific()(summary(model_lr_rating)$coefficients[2, 4])`, which is extremely small. We can conclude that the coefficient is statistically different from zero, and that the feature is related to the target, at least statistically speaking. However, the interpretation we used regarding the coefficient previously is far more useful than the p-value, as the p-value can be affected by many things not necessarily related to the feature-target relationship, such as sample size.
[^coefstats]: Most statistical tables of this sort will use a t (student t distribution), Z (normal distribution), or F (F distribution) statistic. It doesn't really matter for your purposes which is used by default, but the distribution is used to provide the p-value of interest and claim statistical significance (or not).
[^pvaluet]: You can calculate this as `pt(stat, df = model degrees of freedom, lower=FALSE)*2` in R, or use`stats.t.cdf` in Python. The model degrees of freedom provided in the summary output (a.k.a. residual degrees of freedom), are used when obtaining the two-sided p-value, which is what we want in this case. When it comes to t and Z statistics, anything over 2 is statistically significant by the common standard of a p-value of .05 or less. Note that even though output will round it to zero, the true p-value can never be zero.
Aside from the coefficients, the most important output is the **confidence interval** (CI). The CI is a range of values that encapsulates the uncertainty we have in our guess about the coefficients. While our best guess for the effect of word count on rating is `r round(wc_coef, 2)`, we know it's not *exactly* that, and the CI gives us a range of reasonable values we might expect the effect to be based on the data at hand and the model we've employed.
:::{.content-hidden when-format='pdf'}
```{r}
#| echo: false
#| label: tbl-my-first-model-output-ci
#| tbl-cap: Linear Regression Confidence Intervals
broom::tidy(model_lr_rating, conf.int = TRUE) |>
janitor::clean_names() |>
rename(feature = term) |>
mutate(feature = ifelse(feature == '(Intercept)', 'intercept', feature)) |>
gt() |>
tab_style(
style = cell_text(weight = 'bold'),
locations = cells_body(columns = c(conf_low, conf_high))
)
```
:::
In this case, the default is a 95% confidence interval, and we can think of this particular confidence interval like [throwing horseshoes](https://en.wikipedia.org/wiki/Horseshoes_(game)). If we kept collecting data and running models, 95% of our CIs would capture the true value, and this is one of the many possible CIs we could have gotten. That's the technical definition, which is a bit abstract[^whatisaci], but we can also think of it more simply as a range of values that are good guesses for the true value. In this case, the CI is `r paste(rl_ci[1])` to `r paste(rl_ci[2])`, and we can be 95% confident that it is a good ranged estimate for the true coefficient value, whatever it may be. We can also see that the CI is relatively narrow, which is also nice to see, as it implies that we have a good idea of what the coefficient is. If it was very wide, we would have a lot of uncertainty about the coefficient, and we would not likely not want to base important decisions regarding it.
[^whatisaci]: The interpretation regarding the CI is even more nuanced than this, but we'll leave that for another time. For now, we'll just say that the CI is a range of values that are good guesses for the true value. Your authors have used frequentist and Bayesian statistics for many years, and we are fine with both of them, because they both work well enough in the real world. Despite where this ranged estimate comes from, the vast majority use CIs in the same way, and they are a useful tool for understanding the uncertainty in our estimates.
```{r}
#| echo: false
#| label: fig-ci-plot
#| fig-cap: Confidence Intervals as Horseshoes
# from https://en.wikipedia.org/wiki/Horseshoes_(game)
# knitr::include_graphics('img/horseshoes.jpeg')
```
:::{.content-hidden when-format='pdf'}
![Two Confidence Intervals](img/horseshoes.jpeg){width='40%'}
:::
Keep in mind that your chosen model has a great influence on what you'll be able to say at the feature level. As an example, as we get into machine learning models, you won't have as easy a time with coefficients and their confidence intervals, but you still may be able to say something about how your features relate to the target, and we'll continue to return to the topic. But first, let's take a look at interpreting things in another way.
::: {.callout-note title='Hypothesis Testing' collapse='true'}
The confidence interval and p-value will for coefficients in typical statistical linear models will coincide with one another in that, if for a given alpha significance level, if a 1-alpha% CI includes zero, then your p-value will be greater than alpha, and vice versa. This is because the same standard error is used to calculate both. However, the framework of using a CI vs. using the p-value for claiming statistical significance actually came from individuals that were philosophically opposed. Modern day usage of both is a bit of a mess that would upset both Fisher (p-value guy) and Neyman (CI guy), but your authors find that this incorrect practical usage doesn't make much practical difference in the end.
:::
### Model level interpretation {#sec-lm-interpretation-model}
```{r}
#| echo: false
#| label: model-metrics
# need to update later usage to use mr_metrics object or named with mr_prefix
mr_metrics = performance::performance(model_lr_rating)
rsq = round(mr_metrics$R2, 2)
rsq_perc = 100 * rsq
mse = round(mr_metrics$RMSE^2, 2)
rmse = round(mr_metrics$RMSE, 2)
mae = round(performance::performance_mae(model_lr_rating), 2)
aic = round(mr_metrics$AIC, 2)
```
So far, we've focused on interpretation at the feature level. But knowing the interpretation of a feature doesn't do you much good if the model itself is poor! In that case, we also need to assess the model as a whole, and as with the feature level, we can go about this in a few ways. Before getting too carried away with asking whether your model is any good or not, you always need to ask yourself *relative to what*? Many models claim top performance under various circumstances, but which are statistically indistinguishable from many other models. So we need to be careful about how we assess our model, and what we compare it to.
#### Predictions versus observed {#sec-lm-interpretation-pred-obs .unnumbered}
When we looked at the models previously in @fig-my-first-model-predictions-plot, we examined how well the predictions and target line up, and that gave us an initial feel for how well the model fits the data. Most model-level interpretation involves assessing and comparing model fit and variations on this theme. Here we show how easy it is to obtain such a plot.
::: {.panel-tabset}
##### R
```{r}
#| echo: true
#| eval: false
#| label: pred-vs-obs-plot-r
predictions = predict(model_lr_rating)
y = df_reviews$rating
ggplot(
data = data.frame(y = y, predictions = predictions),
aes(x = y, y = predictions)
) +
geom_point() +
labs(x = 'Predicted', y = 'Observed')
```
##### Python
```{python}
#| echo: true
#| eval: false
#| label: pred-vs-obs-plot-py
import matplotlib.pyplot as plt
predictions = model_lr_rating.predict()
y = df_reviews.rating
plt.scatter(y, predictions)
```
:::
```{r}
#| echo: false
#| eval: false
#| label: fig-pp-scatter
#| fig-cap: Predictions vs. Observed Ratings
# This really isn't necessary. Delete at some point.
p1 = df_reviews |>
mutate(pred = predict(model_lr_rating)) |>
ggplot(aes(pred, rating)) +
geom_point() +
# geom_point(position = position_jitter()) +
geom_abline() +
labs(
x = 'Predicted Rating',
y = 'Observed Rating',
subtitle = 'Raw values'
)
p2 = df_reviews |>
mutate(pred = predict(model_lr_rating)) |>
ggplot(aes(pred, rating)) +
geom_point(position = position_jitter(), alpha = .25) +
geom_abline() +
labs(
x = 'Predicted Rating',
y = 'Observed Rating',
subtitle = 'Jittered values'
)
p1 + p2
```
#### Model metrics {#sec-lm-interpretation-model-metrics .unnumbered}
We can also get an overall assessment of the prediction error from a single metric. In the case of the linear model we've been looking at, we can express this as the sum or mean of our squared errors, the latter of which is a very commonly used modeling metric- **MSE** or **mean squared error**. Its square root - **RMSE** or **root mean squared error**[^msermse] - is also very commonly used. We'll talk more about this and similar metrics elsewhere (@sec-knowing-model-metrics), but we can take a look at the RMSE for our model now.
If we look back at our results, we can see this expressed as the part of the output or as an attribute of the model[^sigmanotrmse]. The RMSE is more interpretable, as it gives us a sense that our typical errors bounce around by about `r model_rmse`. Given that the rating is on a 1-5 scale, this maybe isn't bad, but we could definitely hope to do better than get within roughly half a point on this scale. We'll talk about ways to improve this later.
[^msermse]: Any time we're talking about MSE for performance, we're also talking about RMSE, as it's just the square root of MSE, so which you choose is mostly arbitrary. Taking the square root makes the metric more interpretable, as it's in the same units as the target, but it's not necessary to claim one model performs better than another.
[^sigmanotrmse]: The actual divisor for linear regression output depends on the complexity of the model, and in this case the sum of the squared errors is divided by N-2 (due to estimating the intercept and coefficient) instead of N. This is a technical detail that would only matter for data too small to generalize beyond anyway, and not important for our purposes here.
:::{.panel-tabset}
##### R
```{r}
#| label: my-first-model-mse-r
# summary(model_lr_rating) # 'Residual standard error' is approx RMSE
summary(model_lr_rating)$sigma # We can extract it directly
```
##### Python
```{python}
#| label: my-first-model-mse-py
np.sqrt(model_lr_rating.scale) # RMSE
```
:::
Another metric we can use to assess **model fit** in this particular situation is the **mean absolute error** (MAE). MAE is similar to the mean squared error, but instead of squaring the errors, we just take the absolute value. Conceptually it attempts to get at the same idea, how much our predictions miss the target on average, and here the value is `r mae`, which we actually showed in our residual plot (@fig-my-first-model-error-plot). With either metric, the closer to zero the better, since as we get closer, we are reducing prediction error.
We can also look at the **R-squared** (R^2^) value of the model. R^2^ is possibly the most popular measure of model performance with linear regression and linear models in general. Before squaring, it's just the correlation of the predicted versus observed values that we saw in the previous plot (@fig-my-first-model-predictions-plot). When we square it, we can interpret it as a measure of how much of the variance in the target is explained by the model. In this case, our model shows the R^2^ is `r rsq`, which is not bad for a single feature model in this type of setting. We interpret the value as `r rsq_perc`% of the target variance is explained by our model, and more specifically by the features in the model. In addition, we can also interpret R^2^ as 1 - the proportion of error variance in the target, which we can calculate as $1 - \frac{\textrm{MSE}}{var(y)}$. In other words the complement of R^2^ is the proportion of the variance in the target that is not explained by the model. Either way, since `r 100-rsq_perc`% is not explained by the model, our result suggests there is plenty of work left to do!
Note also, that with R^2^ we get a sense of the variance shared between *all* features in the model and the target, however complex the model gets. As long as we use it descriptively as a simple correspondence assessment of our predictions and target, it's a fine metric. For various reasons, it's not a great metric for comparing models to each other, but again, as long as you don't get carried away, it's okay to use.
### Prediction vs. explanation {#sec-lm-prediction-vs-explanation}
In your humble authors' views, one can't stress enough the importance of a model's ability to predict the target. It can be a poor model, maybe because the data is not great, or perhaps we're exploring a new area of research, but we'll always be interested in how well a model **fits** the observed data. In situations where we're focused on how things will work out in the future, we're just as much or even more interested in how well a model predicts *new* data.
In many settings, **statistical significance** is focused on a great deal, and much is made about models that may actually have little predictive power. As strange as it may sound to some, you can read results in journal articles, news features, and business reports in many fields with hardly any mention of a model's predictive capability. In these cases, the focus is almost entirely on the **explanation** of the model, and usually the statistical significance of the features with regard to their relationship to the target.
In those settings, statistical significance is often used as a proxy for importance, though this is rarely ever justified. As we've noted elsewhere, statistical significance is affected by other things besides the size of the coefficient. And without an understanding of the context of the features, in this case, like how long typical reviews are, what their range is, what variability of ratings is, etc., the information it provides is extremely limited, and many would argue, not very useful.
If we are very interested in the coefficient or weight value specifically, it is better to focus on the range of possible values. This is provided by the confidence interval, along with the predictions that come about based on that coefficient's value, which will likewise have interval estimates. Like statistical significance, a confidence interval is also a 'loaded' description of a feature's relationship to the target, not without issues. However, we can use it in a very practical way as a range of possible values for that feature's weight, and more importantly, *think of possibilities rather than certainties*.
Suffice it to say at this point, that how much one focuses on prediction versus explanation depends on the context and goals of the data endeavor. There are cases where predictive capability is of utmost importance, and we care less about explanatory details, but not to the point of ignoring it. For example, even with deep learning models for image classification, where the inputs are just RGB values from an image, we'd still like to know what the (notably complex) model is picking up on, otherwise we may be classifying images based on something like image backgrounds (e.g., outdoors vs. indoors) instead of the objects of actual interest (dogs vs. cats). In some business or other organizational settings, we are very, or even mostly, interested in the coefficients/weights, which might indicate how to allocate monetary resources in some fashion. But if those weights come from a model with no predictive power, placing much importance on them may be a fruitless endeavor.
In the end we'll need to balance our efforts to suit the task at hand. Prediction and explanation are both fundamental to the modeling endeavor. We return to this topic again in the chapter on causal models (@sec-causal-prediction-explanation).
## Adding Complexity {#sec-lm-complexity}
We've seen how to fit a model with a single feature and interpret the results, and that helps us to get oriented to the general process when using a linear model. However, we'll always have more than one feature for a model except under some very specific circumstances, such as exploratory data analysis. So let's see how we can implement a model with more features and that makes more practical sense.
### Multiple features {#sec-lm-multiple-features}
We can add more features to our model very simply. Using the standard functions we've already demonstrated, we just add them to the formula as follows[^rform].
[^rform]: This is the case for both R and [statsmodels]{.pack}.
```{python}
#| eval: false
#| label: lm-extra-features-formula
'y ~ feature_1 + feature_2 + feature_3'
```
In other cases where we use matrix inputs, additional features will just be the additional input columns, and nothing conceptually about the model actually changes.
```{python}