-
Notifications
You must be signed in to change notification settings - Fork 13
/
Copy pathgeneralized_linear_models.qmd
1276 lines (927 loc) · 52.8 KB
/
generalized_linear_models.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Generalized Linear Models {#sec-glm}
![](img/chapter_gp_plots/gp_plot_4.svg){width=75% fig-align='center'}
```{r}
#| include: false
#| label: setup-glm-r
df_reviews = read_csv('data/movie_reviews.csv')
# for the by-hand option later
X = df_reviews |>
select(word_count, male = gender) |>
mutate(male = ifelse(male == 'male', 1, 0)) |>
as.matrix()
y = df_reviews$rating_good
```
```{python}
#| include: false
#| label: setup-glm-py
import pandas as pd
import numpy as np
np.set_printoptions(formatter={'float': lambda x: '{0:0.4f}'.format(x)})
pd.set_option('display.float_format', lambda x: '%.3f' % x)
df_reviews = pd.read_csv('data/movie_reviews.csv')
# for the by-hand option later
X = (
df_reviews[['word_count', 'gender']]
.rename(columns = {'gender': 'male'})
.assign(male = np.where(df_reviews[['gender']] == 'male', 1, 0))
)
y = df_reviews['rating_good']
```
What happens when your target variable isn't really something you feel comfortable modeling with a normal distribution? Maybe you've got a binary condition, like good or bad, or maybe you've got a skewed count of something, like the number of times a person that has been arrested has been reincarcerated. In these cases, you can use a linear regression, but it often won't get you exactly what you want in terms of predictive performance. Instead, you can *generalize* your approach to handle these scenarios.
**Generalized linear models** allow us to implement different probability distributions, taking us beyond the normal distribution that is assumed for linear regression. This allows us to use the *same* linear model framework that we've been using, but with different types of targets. As such, these models *generalize* the linear model to better capture the nuances of different types of feature-target relationships.
## Key Ideas {#sec-glm-key}
- A simple tweak to our previous approach allows us to generalize our linear model to account for other types of target data.
- Common distributions such as binomial, Poisson, and others can often improve model fit and interpretability.
- Getting familiar with just a couple distributions will allow you to really expand your modeling repertoire.
### Why this matters {#sec-glm-why}
The linear model is powerful on its own, but even more so when you realize you can extend it to many other data settings, some of which may have implicitly nonlinear feature-target relationships! When we want to classify observations, count them, or deal with proportions and other things, very simple tweaks of our standard linear model allow us to handle such situations.
### Helpful context {#sec-glm-good2know}
Generalized linear models are a broad class of models that extend the linear model to different distributions of the target variable. In general, you'd need to have a pretty good grasp of linear regression before getting too carried away here.
## Distributions & Link Functions {#sec-glm-distributions}
Remember how linear regression models really enjoy the whole Gaussian, i.e., 'normal', distribution scene? We saw that the essential form of the linear model can be expressed as follows. With probabilistic models such as these, the formula is generally expressed as $y | X, \theta \sim ...$, where X is the matrix of features (data) and $\theta$ the parameters estimated by the model. We simplify this as $y^*$ here.
$$
y| X, \beta, \sigma \sim \textrm{N}(\mu, \sigma^2)
$${#eq-linreg}
$$
y^* \sim \textrm{N}(\mu, \sigma^2)
$$
$$
\mu = \alpha + X\beta
$$
We create the linear combination of our features, and then we employ a normal distribution that uses that combination as the mean, which will naturally vary for each sample of data. However, this may not be the best approach in many cases. Instead, we can use some other distribution that potentially fits the data better. But often these other distributions don't have a direct link to our features, and that's where a **link function** comes in.
Think of the link function as a bridge between our features and the distribution we want to use. It lets us use a linear combination of features to predict the mean or other parameters of the distribution. As an example, we can use a log to link the mean to the linear predictor, or conversely, exponentiate the linear predictor to get the mean. In this example, the log is the link function and we use its inverse to map the linear predictor back to the mean.
More generically, we can write this as follows, with some parameter $\theta$, and where $g$ is the link function with $g^{-1}$ is its inverse. The link function and its inverse relates $x$ to $\theta$.
$$
g(\theta) = x
$${#eq-link}
$$
\theta = g^{-1}(x)
$$
If you know a distribution's 'canonical' link function, which is like the default for a given distribution, that is all the deeper you will probably ever need to go. At the end of the day, these link functions will link your model output to the parameters required for the distribution. The take-away here is that the link function describes how the mean or other parameters of interest are generated from the (linear) combination of features.
:::{.callout-note title='Conditional reminder' collapse='true'}
One thing to note, when we switch distributions for GLMs, we're still concerning ourselves with the *conditional* distribution of the target variable given the features. The distribution of the target variable itself is not changing per se, even though its nature, e.g., as a binary variable, is suggesting to us to try something that would allow us to produce a binary outcome from the model. But just like we don't assume the target itself is normal in a linear regression, here we are assuming that the conditional distribution of the target given the features is the distribution we are specifying.
:::
## Logistic Regression {#sec-glm-logistic}
As we've seen, you will often have a binary variable that you might want to use as a target -- it could be dead/alive, lose/win, quit/retain, etc. You might be tempted to use a linear regression, but you will quickly find that it's not the best option in that setting. So let's try something else.
### The binomial distribution {#sec-glm-binomial}
**Logistic regression** differs from linear regression mostly because it is used with a binary target instead of a continuous one as with linear regression. As a result, we typically assume that the target follows a **binomial distribution**. Unlike the normal distribution, which is characterized by its mean ($\mu$) and variance ($\sigma^2$), the binomial distribution is defined by the parameters: *p* (also commonly $\pi$) and a known value *n*. Here, *p* represents the probability of a specific event occurring (like flipping heads, winning a game, or defaulting on a loan), and *n* is the number of trials or attempts under consideration.
It's important to note that the binomial distribution, which is commonly employed in GLMs for logistic regression, doesn't just describe the probability of a single event that is observed or not. It actually represents the distribution of the number of successful outcomes in *n* trials, which can be greater than 1. In other words, *it's a count distribution* that tells us how many times we can expect the event to occur in a given number of trials.
Let's see how the binomial distribution looks with 100 trials and probabilities of 'success' at *p = * .25, .5, and .75:
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-binomial
#| fig-cap: Binomial Distributions for Different Probabilities
set.seed(1234)
p_dat = tibble(
key = 1:100,
`.25` = dbinom(1:100, size = 100, prob = .25),
`.50` = dbinom(1:100, size = 100, prob = .5),
`.75` = dbinom(1:100, size = 100, prob = .75)
) |>
pivot_longer(cols = -key, names_to = 'prob', values_to = 'value')
# bc of longstanding density/count bug where grouping doesn't actually separate, and overlapping values 'stack', we have to do individual plots
p_dat |>
ggplot(aes(x = key, y = value, fill = prob)) +
geom_col(color = NA, position = position_dodge(), width = 2) +
scale_fill_manual(values = unname(okabe_ito)) +
labs(x = 'Number of Successes', y = 'Probability') +
guides(fill = guide_legend(title = 'Probability\nValues'))
# run as needed
# p_print = p_dat |>
# mutate(prob = paste0('p = ', prob)) |>
# ggplot(aes(x = key, y = value)) +
# geom_col(color = NA, width = 2) +
# facet_wrap(~prob, ncol=1, strip.position = 'right') +
# labs(x = 'Number of Successes', y = 'Probability') +
# guides(color = guide_legend(title = 'Probability\nValues')) +
# theme(
# # strip.placement = 'outside',
# strip.text.y.right = element_text(size = 12, angle = 0)
# )
# # p_print
# ggsave(
# 'img/glm-binomial-distribution.svg',
# p_print,
# width = 8,
# height = 6,
# )
```
:::
:::{.content-visible when-format='pdf'}
![Binomial Distributions for Different Probabilities](img/glm-binomial-distribution.svg)
:::
If we examine the distribution for a probability of .5, we will see that it is roughly centered over a total success count of 50 out of the 100 trials. This tells us that we have the highest probability of encountering 50 successes if we ran 100 trials. Shifting our attention to a .75 probability of success, we see that our distribution is centered over 75. In practice we probably end up with something around that value, but on average and over repeated runs of 100 trials, the value would be $p\cdot n$. Try it yourself.
:::{.panel-tabset}
##### R
Note that R switches 'size' and 'n' names relative to [numpy]{.pack}. `n` regards the number of values you want returned.
```{r}
#| label: binom-demo-r
#| results: hold
set.seed(123)
# produces a count whose mean is n*p
rbinom(n = 6, size = 100, prob = .75)
# produces a binary 0, 1 as seen in logistic regression target (with mean p)
rbinom(n = 6, size = 1, prob = .75)
```
##### Python
Note that [numpy]{.pack} switches 'size' and 'n' names relative to R. Here `n` is the same as depicted in the formulas later.
```{python}
#| label: binom-demo-py
#| results: hold
import numpy as np
np.random.seed(123)
# produces a count whose mean is n*p
np.random.binomial(n = 100, p = .75, size = 6)
# produces a binary 0, 1 as seen in logistic regression target (with mean p)
np.random.binomial(n = 1, p = .75, size = 6)
```
:::
Since we are dealing with a number of trials, it is worth noting that the binomial distribution is a discrete distribution. If we have any interest in knowing the probability for a number of successes we can use the following formula, where $n$ is the number of trials, $x$ is the number of successes, and $p$ is the probability of success:
$$
P(x) = \frac{n!}{(n-x)!x!}p^x(1-p)^{n-x}
$$ {#eq-binomial}
Now let's see how the binomial distribution relates to the linear model space:
$$y^* \sim \textrm{Binomial}(n, p) $$
$$
\textrm{logit}(p) = \alpha + X\beta
$$ {#eq-logreg-model}
In this case, we are using the *logit* function to map the linear combination of our features to the probability of success. The logit function is defined as:
$$\textrm{log}\frac{p}{1-p}$$
We are literally just taking the log of the odds of whichever label we're calling 'success' in the binomial sense. For example, if we are predicting the probability of a person subscribing to a membership, we might call 'subscribes' the 'success' label.
Now we can map this back to our model:
$$\textrm{log}\frac{p}{1-p} = \alpha + X\beta$$
And finally, we can take that logistic function and use the **inverse-logit** function to produce the probabilities.
$$p = \frac{\textrm{exp}(\alpha + X\beta)}{1 + \textrm{exp}(\alpha + X\beta)}$$
or equivalently:
$$p = \frac{1}{1 + \textrm{exp}(-(\alpha + X\beta))}$$
Whenever we get results for a logistic regression model, the default coefficients and predictions are almost always on the log odds scale. We usually exponentiate the coefficients to get the **odds ratio**. For example, if we have a coefficient of .5, we would say that for every one unit increase in the feature, the odds of the target being a 'success' increase by a factor of `exp(.5)` = `r round(exp(.5),1)`. And we can convert the predicted log odds to probabilities using the inverse-logit function. We explore this more in the next section.
### Probability, odds, and log odds {#sec-glm-binomial-prob}
Probability lies at the heart of a logistic regression model, so let's look more closely at the relationship between the probability and log odds. In our model, the log odds are produced by the linear combination of our features. Let's say we have a model that gives us those values for each observation. We can then convert them from the linear space to the (nonlinear) probability space with our inverse-logit function, which might look something like this.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-prob-log-odds
#| fig-cap: Log Odds and Probability Values
probability = seq(.0001, .9999, by = .01)
p_dat = tibble(log_odds = log(probability / (1- probability)), probability = probability)
p = ggplot(
p_dat,
aes(log_odds, probability)) +
geom_hline(yintercept = .5, linetype = 'dashed', linewidth = .75) +
geom_vline(xintercept = 0, linetype = 'dashed', linewidth = .75) +
geom_path(, linewidth = 1) +
coord_cartesian(x = c(-5, 5), y = c(0, 1)) +
labs(x = 'Log Odds', y = 'Probability')
p
ggsave(
'img/glm-prob-log-odds.svg',
p,
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf'}
![Log Odds and Probability Values](img/glm-prob-log-odds.svg)
:::
We can see that the probability of success approaches 0 when the log odds are increasingly negative, and approaches 1 when the log odds are increasingly positive. The shape is something like an S, which also tells us that we are not in linear space when we switch to probabilities.
Log odds have a nice symmetry around 0, where the probability of success is 0.5. Any value above 0 indicates a probability of success greater than 0.5, and any value below 0 indicates a probability of success less than 0.5. However, don't get too hung up on a .5 probability as being fundamentally important for any given problem.
As mentioned, logistic regression models usually report coefficients on the log-odds scale by default. The coefficients reflect the odds associated with predicted probabilities given the feature at different values one unit apart. Log-odds are not the most intuitive thing to interpret. For additional interpretability, we often convert the coefficients to odds ratios by exponentiating them. In logistic regression models, the odds ratio is the ratio of the odds of the outcome occurring (vs. not occurring) for a one unit increase in the feature.
<!-- MC thinks this still may be too much, given that it seems fairly rare to care about odds ratio outside of academic settings. Leave for now though -->
The following function will calculate the odds ratio for two probabilities, which we can think of as prediction outcomes for two values of a feature one unit apart.
:::{.panel-tabset}
##### R
```{r}
#| label: calculate-odds-ratio-r
#| results: 'hide'
calculate_odds_ratio = function(p_1, p_2) {
odds_1 = p_1 / (1 - p_1)
odds_2 = p_2 / (1 - p_2)
odds_ratio = odds_2 / odds_1
tibble(
value = c('1', '2'),
p = c(p_1, p_2),
odds = c(odds_1, odds_2),
log_odds = log(odds),
odds_ratio = c(NA, odds_ratio)
)
}
result_A = calculate_odds_ratio(.5, .6)
result_B = calculate_odds_ratio(.1, .2)
result_C = calculate_odds_ratio(.9, .8) # inverse of the .1, .2 example
result_A
```
##### Python
```{python}
#| label: calculate-odds-ratio-py
#| results: 'hide'
import pandas as pd
import numpy as np
def calculate_odds_ratio(p_1, p_2):
odds_1 = p_1 / (1 - p_1)
odds_2 = p_2 / (1 - p_2)
odds_ratio = odds_2 / odds_1
return pd.DataFrame({
'value': ['1', '2'],
'p': [p_1, p_2],
'odds': [odds_1, odds_2],
'log_odds': [np.log(odds_1), np.log(odds_2)],
'odds_ratio': [np.nan, odds_ratio]
})
result_A = calculate_odds_ratio(.5, .6)
result_B = calculate_odds_ratio(.1, .2)
result_C = calculate_odds_ratio(.9, .8) # inverse of the .1, .2 example
result_A
```
:::
```{r}
#| echo: false
#| label: tbl-odds-ratio
#| tbl-cap: Odds Ratios for Different Probabilities
bind_rows(list(A = result_A, B = result_B, C = result_C), .id = 'setting') |>
# group_by(setting) |>
gt(groupname_col = 'setting', row_group_as_column = TRUE) |>
tab_footnote(
'The odds are p / (1 - p)',
locations = cells_column_labels(columns = vars(odds)),
placement = 'left'
) |>
tab_footnote(
'The odds ratio refers to value 2 versus value 1',
locations = cells_column_labels(columns = vars(odds_ratio)),
placement = 'left'
) |>
tab_options(
footnotes.font.size = 12
)
```
In the table we see that even though each probability difference is the same, the odds ratio is different. Comparing A to B, the difference between a probability of .5 to .6 is not as much of a change on the odds (linear) scale as the difference between .1 to .2. The first setting is a 50% increase in the odds, whereas the second more than doubles the odds. However, the difference between .9 to .8 is the same probability difference as the difference between .1 to .2, as they reflect points that are the same distance from the boundary. The odds ratio for setting C is just the inverse of setting B. The following shows our previous plot with the corresponding settings shaded.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-prob-log-odds-shade
#| fig-cap: Comparison of Probability and Odds Differences
#| out-width: 100%
probability = seq(.0001, .9999, by = .01)
p_dat = tibble(log_odds = log(probability / (1- probability)), probability = probability)
p = ggplot(
p_dat,
aes(log_odds, probability)) +
geom_path() +
geom_ribbon(
xmin = -Inf,
aes(xmax=log_odds),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .5, .601))
) +
geom_ribbon(
ymin=-Inf,
aes(ymax=probability),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .5, .601))
) +
geom_ribbon(
xmin = -Inf,
aes(xmax=log_odds),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .1, .201))
) +
geom_ribbon(
ymin=-Inf,
aes(ymax=probability),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .1, .201))
) +
geom_ribbon(
xmin = -Inf,
aes(xmax=log_odds),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .8, .901))
) +
geom_ribbon(
ymin=-Inf,
aes(ymax=probability),
fill=okabe_ito['darkblue'],
alpha=0.2,
data = p_dat |> filter(between(probability, .8, .901))
) +
scale_y_continuous(breaks = seq(0, 1, .1)) +
coord_cartesian(x = c(-5, 5), y = c(0, 1)) +
labs(
x = 'Log Odds',
y = 'Probability',
caption = 'Shaded areas represent the probability differences in the table'
)
p
ggsave(
'img/glm-prob-log-odds-shade.svg',
p,
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf'}
![Comparison of Probability and Odds Differences](img/glm-prob-log-odds-shade.svg){width=100%}
:::
Odds ratios *might* be more interpretable to some, but since they are ratios of ratios, people have historically had a hard time with those as well. As shown in @tbl-odds-ratio, knowledge of the baseline rate is *required* for a good understanding of them. Furthermore, doubling the odds is not the same as doubling the probability, so we're left doing some mental calisthenics to interpret them. Odds ratios are often used in academic settings, but elsewhere they are not as common. The take-home message is that we can interpret our result in terms of odds (ratios of probabilities), log-odds (linear space), or as probabilities (nonlinear space), but it can take a little more effort than our linear regression setting[^moreor]. Our own preference is to stick with predicted probabilities, but it's good to have familiarity with odds ratios, as well as understand the purely linear aspect of the model.
[^moreor]: For more on interpreting odds ratios, see [this article](https://stats.oarc.ucla.edu/other/mult-pkg/faq/general/faq-how-do-i-interpret-odds-ratios-in-logistic-regression/).
### A logistic regression model {#sec-glm-binomial-model}
Now let's get our hands dirty and do a classification model using logistic regression. For our model, let's return to the movie review data, but now we'll use the binary `rating_good` ('good' vs. 'bad') as our target. Before we get to modeling, see if you can find out the frequency of 'good' and 'bad' reviews, and the probability of getting a 'good' review. We examine the relationship of `word_count` and `gender` features with the likelihood of getting a good rating.
:::{.panel-tabset}
##### R
```{r}
#| label: import-data-model-logistic-r
df_reviews = read_csv('https://tinyurl.com/moviereviewsdata')
model_logistic = glm(
rating_good ~ word_count + gender,
data = df_reviews,
family = binomial
)
summary(model_logistic)
```
##### Python
```{python}
#| label: import-data-model-logistic- py
import pandas as pd
import statsmodels.formula.api as smf
import statsmodels.api as sm
df_reviews = pd.read_csv('https://tinyurl.com/moviereviewsdata')
model_logistic = smf.glm(
'rating_good ~ word_count + gender',
data = df_reviews,
family = sm.families.Binomial()
).fit()
model_logistic.summary()
```
:::
Now that we have some results, we can see that they aren't too dissimilar from the linear regression output we obtained before. But, let's examine them more closely in the next section.
:::{.callout-note title='Binomial Regression' collapse='true'}
As noted, the binomial distribution is a count distribution. For a binary outcome, we can only have a 0 or 1 outcome for each 'trial', and the 'size' or 'n' for the binomial distribution is 1. In this case, we can also use the [Bernoulli distribution](https://en.wikipedia.org/wiki/Bernoulli_distribution) ($\textrm{Bern}(p)$). This does not require the number of trials, since, when the number of trials is 1 the factorial part of @eq-binomial drops out.
Many coming from a non-statistical background are not aware that their logistic model can actually handle count and/or proportional outcomes.
:::
### Interpretation and visualization {#sec-glm-binomial-interpret}
If our modeling goal is not just producing predictions, we need to know what those results mean. The coefficients that we get from our model are in the log odds scale. Interpreting log odds is difficult, but we can at least get a feeling for them directionally. A log odds of 0 (odds ratio of 1) would indicate no relationship between the feature and target. A positive log odds would indicate that an increase in the feature will increase the log odds of moving from 'bad' to 'good', whereas a negative log odds would indicate that an increase in the feature will decrease the log odds of moving from 'bad' to 'good'. On the log odds scale, the coefficients are symmetric as well, such that, e.g., a +1 coefficient denotes a similar increase in the log odds as a -1 coefficient denotes a decrease. As we demonstrated, we can exponentiate them to get the odds ratio for additional interpretability.
```{r}
#| echo: false
#| label: tbl-show-odds-ratio
#| tbl-cap: Raw Coefficients and Odds Ratios for a Logistic Regression
lo_result = parameters::model_parameters(model_logistic)
or_result = exp(lo_result$Coefficient)
me_logistic = marginaleffects::avg_slopes(model_logistic)
me_logistic_wc = me_logistic$estimate[me_logistic$term == 'word_count']
lo_result |>
as_tibble() |>
select(Parameter, Coefficient) |>
mutate(`Exp. Coef (OR)` = exp(Coefficient)) |>
gt()
```
The intercept provides a baseline odds of a 'good' review when word count is 0 and gender is 'female'. From there, we see that we've got a negative raw coefficient and odds ratio of `r round(or_result[2], 2)` for the word count variable. We have a positive raw coefficient and `r round(or_result[3], 2)` odds ratio for the male variable. This means that for every one unit increase in word count, the odds of a 'good' review decreases by about `r 100*round(1-or_result[2], 2)`%. Males are associated with an odds of a 'good' review that is `r 100*round(or_result[3] -1, 2)`% higher than females.
We feel it is much more intuitive to interpret things on the probability scale, so we'll get predicted probabilities for different values of the features. The way we do this is through the (inverse) link function, which will convert our log odds of the linear predictor to probabilities. We can then look at specific predictions, calculate marginal effects, or plot these probabilities to see how they change with the features. For the word count feature in the following visualization, we hold gender at the reference group ('female'), and for the gender feature, we hold word count at its mean. In addition we convert the probability to the percentage chance of a 'good' review.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-logistic-regression-word-count
#| fig-cap: Model Predictions for Word Count Feature
#| out-width: 100%
p_prob = ggeffects::ggpredict(model_logistic, terms = c('word_count')) |>
plot(color = okabe_ito['darkblue'], use_theme = FALSE) +
annotate(geom = 'text', x = 20, y = plogis(1), label = 'Not so much!') +
scale_fill_manual(values = okabe_ito) +
lims(y = c(0.02, .9), x = c(0, 32)) +
scale_y_continuous(labels = scales::percent) +
labs(x = 'Word Count', y = '% Good Rating', title = '')
p_dat = p_prob$data |>
as_tibble()
logit_preds = predict(
model_logistic,
newdata = as_tibble(p_dat) |> mutate(word_count = x, gender = 'female'),
se.fit = TRUE
)
p_dat = p_dat |>
mutate(
logit = logit_preds$fit,
ll = logit_preds$fit - 1.96*logit_preds$se.fit,
ul = logit_preds$fit + 1.96*logit_preds$se.fit
)
p_logit = p_dat |>
ggplot(aes(x = x, y = logit)) +
geom_ribbon(aes(ymin = ll, ymax = ul), fill = okabe_ito['darkblue'], alpha = .2) +
geom_line(color = okabe_ito['darkblue']) +
annotate(geom = 'text', x = 20, y = 1, label = 'Linear!') +
lims(y = c(qlogis(0.02), qlogis(.9)), x = c(0, 32)) +
labs(x = 'Word Count', y = 'Log Odds', title = '')
# ggeffects::ggpredict(model_logistic, terms = c('word_count')) |>
# plot(color = okabe_ito['darkblue'], use_theme = FALSE) +
# scale_fill_manual(values = unname(okabe_ito)) +
# scale_y_continuous(
# "rating_good",
# sec.axis = sec_axis(~ qlogis(.), name = "Log Odds")
# ) +
# labs(x = 'Word Count', y = '% Good Rating', title = 'Logistic Regression Predictions')
(p_logit + p_prob) +
plot_layout(guides = 'collect') +
plot_annotation(
title = 'Logistic Regression Predictions',
caption = "Note: The shaded area represents the 95% confidence interval.\nWord count relationship shown is with Gender set at the reference level ('Female')."
) &
theme(
axis.title.y = element_text(size = 12),
axis.title.x = element_text(size = 12),
)
ggsave(
'img/glm-word-count-prob.svg',
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf'}
![Model Predictions for Word Count Feature](img/glm-word-count-prob.svg){#fig-logistic-regression-word-count}
:::
In @fig-logistic-regression-word-count, we can see a clear negative relationship between the number of words in a review and the probability of being considered a 'good' movie. As we get over 20 words, the predicted probability of being a 'good' movie is less than .2. We also calculated the average marginal effect (@sec-avg-marginal-effects), or average slope, for word count. It suggests a `r round(me_logistic_wc, 2)` `r word_sign(me_logistic_wc, c('increase', 'decrease'))` in the probability of a 'good' rating for each additional word in the review (on average).
We also see the increase in the chance for a good rating with males vs. females, but our model results suggest this is not a statistically significant difference.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-logistic-regression-gender
#| fig-cap: Model Predictions for Gender
#| out-width: 100%
p = ggeffects::ggpredict(model_logistic, terms = c('gender')) |>
plot(
color = c(okabe_ito[['orange']], okabe_ito[['darkblue']]),
dot_size = 10,
dot_alpha = 1, # ignored, so added back with geom_point
line_size = 2,
use_theme = FALSE
) +
geom_point(color = okabe_ito['darkblue'], alpha = 1, size = 10) +
scale_fill_manual(values = unname(okabe_ito)) +
scale_x_continuous(breaks = c(1, 2), labels = c('Female', 'Male')) +
scale_y_continuous(limits = c(.5, .65), labels = scales::percent) +
labs(
x = '',
y = '% Good Rating',
title = 'Logistic Regression Predictions',
caption = "Note: Lines represent the 95% confidence interval.\nGender relationship shown is with Word Count fixed at the mean."
)
p
ggsave(
'img/glm-gender-prob.svg',
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf}
![Model Predictions for Gender](img/glm-gender-prob.svg){#fig-logistic-regression-gender}
:::
In the end, whether you think these differences are practically significant is up to you. And you'll still need to do the standard model exploration to further understand the model (@sec-knowing has lots of detail on this). But this is a good start.
:::{.callout-note title='A Note on $R^2$ for Logistic Regression' collapse='true'}
Logistic regression does not have an $R^2$ value in the way that a linear regression model does. Instead, there are pseudo-$R^2$ values, but they are not the same as the $R^2$ value that you are used to seeing (@ucla_advanced_research_computing_faq_2023).
:::
## Poisson Regression {#sec-glm-poisson}
**Poisson regression** also belongs to the class of generalized linear models, and is used specifically when you have a count variable as your target. After logistic regression for binary outcomes, Poisson regression is probably the next most common type of generalized linear model you will encounter. Unlike continuous targets, a count starts at 0 and can only be a whole number. Often it is naturally skewed as well, so we'd like a model well-suited to this situation. Unlike the binomial, there is no concept of number of trials, just the count of events.
### The Poisson distribution {#sec-glm-poisson-distribution}
The Poisson distribution is very similar to the binomial distribution, because the binomial is also a count distribution, and in fact generalizes the poisson[^poisbin]. The Poisson has a single parameter noted as $\lambda$, which makes it the simplest model setting we've seen so far[^novariance]. Conceptually, this rate parameter is going to estimate the expected number of events during a time interval. This can be accidents in a year, pieces produced in a day, or hits during the course of a baseball season.
[^novariance]: Neither the binomial nor the Poisson have a variance parameter to estimate, as the variance is determined by the mean. This is in contrast to a normal distribution model, where the variance is an estimated parameter. For the Poisson, the variance is equal to the mean, and for the binomial, the variance is equal to $n*p*(1-p)$. The Poisson assumption of equal variance rarely holds up in practice, so people often use the **negative binomial** distribution instead.
[^poisbin]: If your binomial setting has a very large number of trials relative to the number of successes, which amounts to very small proportions $p$, you would find that the binomial distribution would converge to the Poisson distribution.
Let's see what the particular distribution might look like for different rates. We can see that for low count values, the distribution is skewed to the right, but note how the distribution becomes more symmetric and bell-shaped as the rate increases[^nopoisneeded]. You might also be able to tell that the variance increases along with the mean, and in fact, the variance is equal to the mean for the Poisson distribution.
[^nopoisneeded]: From a modeling perspective, for large mean counts you can just go back to using the normal distribution if you prefer without much losing much predictively and a gaining in interpretability.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-poisson-distribution-rates
#| fig-cap: Poisson Distributions for Different Rates
#| out-width: 100%
p_dat = tibble(
key = 1:28,
`4` = dpois(1:28, 4),
`8` = dpois(1:28, 8),
`12` = dpois(1:28, 12)
# data.frame(x = 1:28, y = dpois(1:28, (1 / 7) * 28)),
# data.frame(x = 1:28, y = dpois(1:28, (3 / 7) * 28)),
# data.frame(x = 1:28, y = dpois(1:28, (5 / 7) * 28)),
# .id = 'count'
) |>
pivot_longer(cols = -key, names_to = 'rate', values_to = 'value')
# run as needed
# p_print = p_dat |>
# mutate(rate = fct_inorder(paste0('mean = ', rate))) |>
# ggplot(aes(x = key, y = value)) +
# geom_histogram(binwidth = 1, stat='identity') +
# facet_wrap(~rate, ncol=1, strip.position = 'right') +
# labs(
# x = 'Number of Observed Events',
# y = 'Probability',
# caption = 'Different rates of the Poisson distribution'
# ) +
# # guides(color = guide_legend(title = 'Probability\nValues')) +
# theme(
# # strip.placement = 'outside',
# strip.text.y.right = element_text(size = 12, angle = 0)
# )
# ggsave(
# plot = p_print,
# 'img/glm-poisson-distribution-rates.svg',
# width = 8,
# height = 6,
# )
# bc of longstanding density/count bug where grouping doesn't actually separate, and overlapping values 'stack', we have to do individual plots
p_dat = bind_rows(
data.frame(x = 1:28, y = dpois(1:28, (1 / 7) * 28)),
data.frame(x = 1:28, y = dpois(1:28, (3 / 7) * 28)),
data.frame(x = 1:28, y = dpois(1:28, (5 / 7) * 28)),
.id = 'count'
)
p = p_dat |>
mutate(
count = c((1 / 7) * 28, (3 / 7) * 28, (5 / 7) * 28)[as.integer(count)],
count = factor(round(count))
) |>
ggplot(aes(x, y)) +
geom_col(
aes(fill = count, group = count),
color = NA,
alpha = .5,
position = position_identity() # due to geom_col bug that incorrectly stacks data
) +
scale_fill_manual(values = unname(okabe_ito[c(1,5,4)])) +
labs(
x = 'Number of Observed Events',
y = 'Probability',
caption = 'Different rates of the Poisson distribution'
)
p
```
:::
:::{.content-visible when-format='pdf'}
![Poisson Distributions for Different Rates](img/glm-poisson-distribution-rates.svg)
:::
:::{.callout-note title='More Poisson' collapse='true'}
A cool thing about these distributions is that they can deal with different *exposure* rates. They can also be used to model inter-arrival times and time-until-events.
:::
Let's make a new variable that will count the number of times a person uses a personal pronoun word in their review. We'll use it as our target variable and see how it relates to the number of words and gender in a review as we did before.
:::{.panel-tabset}
##### R
```{r}
#| label: create-poss-pronoun-r
#| fig-show: 'hide'
df_reviews$poss_pronoun = stringr::str_count(
df_reviews$review_text,
'\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b'
)
hist(df_reviews$poss_pronoun)
```
##### Python
```{python}
#| label: create-poss-pronoun-py
#| fig-show: 'hide'
df_reviews['poss_pronoun'] = (
df_reviews['review_text']
.str.count('\\bI\\b|\\bme\\b|\\b[Mm]y\\b|\\bmine\\b|\\bmyself\\b')
)
df_reviews['poss_pronoun'].hist()
```
:::
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-poss-pronoun
#| fig-cap: Distribution of the Personal Pronouns Seen Across Reviews
p = ggplot(df_reviews, aes(poss_pronoun)) +
geom_histogram(binwidth = .5, fill = okabe_ito[1]) +
labs(x = 'Number of Personal Pronouns', y = 'Count')
p
ggsave(
'img/glm-poss-pronoun.svg',
p,
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf'}
![Distribution of Personal Pronouns Seen Across Reviews](img/glm-poss-pronoun.svg)
:::
### A Poisson regression model {#sec-glm-poisson-model-fitting}
Recall that GLM specific distributions have a default link function. The Poisson distribution uses a log link function:
$$
y^* \sim \textrm{Poisson}(\lambda)
$${#eq-poisson}
$$
\text{log}(\lambda) = \alpha + X\beta
$$
[^defaultlink]: It is not uncommon in many disciplines to use different link functions for logistic models, but the log link is always used for Poisson models.
Using the log link keeps the outcome non-negative when we use the inverse of it. For model fitting with standard functions, all we have to do is switch the family from 'binomial' to 'poisson'. As the default link is the 'log', we don't have to specify it explicitly[^defaultlink].
So in this model we'll predict the number of personal pronouns used in a review, and we'll use word count and gender as our features like we did with the logistic model.
:::{.panel-tabset}
##### R
```{r}
#| label: model-poisson-r
model_poisson = glm(
poss_pronoun ~ word_count + gender,
data = df_reviews,
family = poisson
)
summary(model_poisson)
# exponentiate the coefficients to get the rate ratio
# exp(model_poisson$coefficients)
```
##### Python
```{python}
#| label: model-poisson-py
model_poisson = smf.glm(
formula = 'poss_pronoun ~ word_count + gender',
data = df_reviews,
family = sm.families.Poisson()
).fit()
model_poisson.summary()
# exponentiate the coefficients to get the rate ratio
# np.exp(model_poisson.params)
```
```{python}
#| echo: false
#| label: model-poisson-py-for-real
# sigh https://github.com/quarto-dev/quarto-cli/issues/9389 This is done to suppress the output
model_poisson = smf.glm(
formula = 'poss_pronoun ~ word_count + gender',
data = df_reviews,
family = sm.families.Poisson()
).fit()
```
:::
### Interpretation and visualization {#sec-glm-poisson-interpret}
Now let's check out the results more deeply. Like with logistic, we can exponentiate the coefficients to get what's now referred to as the [rate ratio](https://en.wikipedia.org/wiki/Rate_ratio#:~:text=In%20epidemiology%2C%20a%20rate%20ratio,any%20given%20point%20in%20time.). This is the ratio of the rate of the outcome occurring for a one unit increase in the feature.
```{r}
#| echo: false
#| label: tbl-show-rate-ratio
#| tbl-cap: Rate Ratios for a Poisson Regression
log_result = parameters::model_parameters(model_poisson)
rr_result = exp(log_result$Coefficient)
me_poisson = marginaleffects::avg_slopes(model_poisson)
me_poisson_wc = me_poisson$estimate[me_poisson$term == 'word_count']
log_result |>
as_tibble() |>
select(Parameter, Coefficient) |>
mutate(
`Exp. Coef` = exp(Coefficient),
) |>
gt(decimals = 2)
```
For this model, an increase in one review word leads to an expected log count `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` of ~`r round(log_result$Coefficient[2], 2)`. We can exponentiate this to get `r round(rr_result[2], 2)`, and this tells us that every added word in a review gets us a `r scales::label_percent()(round(rr_result[2], 2) - 1) ` `r word_sign(log_result$Coefficient[2], c('increase', 'decrease'))` in the number of possessive pronouns. This is probably not a surprising result - wordier stuff has more types of words! In addition, the average marginal effect (@sec-avg-marginal-effects), or average slope, for word count suggested a `r round(me_poisson_wc, 2)` `r word_sign(me_poisson_wc, c('increase', 'decrease'))` in the number of possessive pronouns per word on average. A similar, though slightly smaller, increase is seen for males relative to females, but, as with our previous model, this is not statistically significant.
But as usual, the visualization tells the better story. Here is the relationship for word count. Notice that the predictions are not discrete like the raw count, but continuous. This is because predictions here are the same as with our other models, and reflect the expected, or average, count that we'd predict with this data.
:::{.content-visible when-format='html'}
```{r}
#| echo: false
#| label: fig-poisson-regression
#| fig-cap: Poisson Model Predictions for Word Count Feature
p = ggeffects::ggpredict(model_poisson, terms = 'word_count') |>
plot(colors = unname(okabe_ito['darkblue']), use_theme = FALSE) +
scale_y_continuous(breaks = 1:5) +
labs(
x = 'Word Count',
y = 'N Pronouns',
title = 'Poisson Regression Predictions',
caption = 'Word count relationship shown is with gender set to the reference level (female)'
)
p
ggsave(
'img/glm-poisson-word-count.svg',
p,
width = 8,
height = 6,
)
```
:::
:::{.content-visible when-format='pdf'}
![Poisson Model Predictions for Word Count Feature](img/glm-poisson-word-count.svg){#fig-poisson-regression}
:::
With everything coupled together, we have an interpretable coefficient for `word_count`, a clear plot, and adequate model fit. Therefore, we might conclude that there is a positive relationship between the number of words in a review and the number of times a person uses a personal possessive.
:::{.callout-note title='Nonlinear linear models?' collapse='true'}
You'll note again that our effects for word count in the logistic (@fig-logistic-regression-word-count) and Poisson (@fig-poisson-regression) models were not exactly the straightest of lines. Once we're on the probability and count scales, we're not going to see the same linear relationships that we might expect from a linear regression model due to the transformation. If we plot the effect on the log-odds or log-count scale, we're back to straight lines, as demonstrated with the logistic model. This is a first taste in how the linear model can be used to get at nonlinear relationships depending on the scale we focus on. More explicit nonlinear relationships are the focus of @sec-lm-extend.
:::
## DIY {#sec-glm-objective}
If we really want to demystify the modeling process for GLMs, let's create our own function to estimate the coefficients. We can use maximum likelihood estimation to estimate the parameters of our model, which is the approach used by standard package functions. Feel free to skip this part if you only wanted the basics, but for even more information on maximum likelihood estimation, see @sec-estim-maxlike where we take a deeper dive into the topic and with a similar function. The following code is a simple and conceptual version of what goes on behind the scenes with 'glm' type functions.
:::{.panel-tabset}
##### R
```{r}
#| label: glm-ml-r
glm_simple = function(par, X, y, family = 'binomial') {
# add a column for the intercept
X = cbind(1, X)
# Calculate the linear predictor
mu = X %*% par # %*% is matrix multiplication
# get the likelihood for the binomial or poisson distribution
if (family == 'binomial') {
# Convert to a probability ('logit' link/inverse)
p = 1 / (1 + exp(-mu))
L = dbinom(y, size = 1, prob = p, log = TRUE)
}
else if (family == 'poisson') {
# Convert to a count ('log' link/inverse)
p = exp(mu)
L = dpois(y, lambda = p, log = TRUE)
}
# return the negative sum of the log-likelihood (for minimization)
value = -sum(L)
return(value)
}
```
##### Python
```{python}
#| label: glm-ml-py
from scipy.stats import poisson, binom