-
Notifications
You must be signed in to change notification settings - Fork 137
/
Copy path06-Logistic-Regression.Rmd
1307 lines (977 loc) · 91.5 KB
/
06-Logistic-Regression.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
---
title: "Chapter 6"
subtitle: "Logistic Regression"
output:
pdf_document:
number_sections: yes
html_document: default
---
# Logistic Regression {#ch-logreg}
```{r,include=F}
if(knitr::is_html_output()){options(knitr.table.format = "html")} else {options(knitr.table.format = "latex")}
```
`r if (knitr:::is_html_output()) {"\\newcommand{\\lik}{\\mathrm{Lik}} \\newcommand{\\Lik}{\\mathrm{Lik}}"}`
## Learning Objectives
- Identify a binomial random variable and assess the validity of the binomial assumptions.
- Write a generalized linear model for binomial responses in two forms, one as a function of the logit and one as a function of $p$.
- Explain how fitting a logistic regression differs from fitting a linear least squares regression (LLSR) model.
- Interpret estimated coefficients in logistic regression.
- Differentiate between logistic regression models with binary and binomial responses.
- Use the residual deviance to compare models, to test for lack-of-fit when appropriate, and to check for unusual observations or needed transformations.
```{r load_packages6, message = FALSE}
# Packages required for Chapter 6
library(gridExtra)
library(mnormt)
library(lme4)
library(knitr)
library(pander)
library(tidyverse)
```
## Introduction to Logistic Regression
Logistic regression is characterized by research questions with binary (yes/no or success/failure) or binomial (number of yesses or successes in $n$ trials) responses:
a. Are students with poor grades more likely to binge drink?
b. Is exposure to a particular chemical associated with a cancer diagnosis?
c. Are the number of votes for a congressional candidate associated with the amount of campaign contributions?
**Binary Responses:** Recall from Section \@ref(sec-binary) that binary responses take on only two values: success (Y=1) or failure (Y=0), Yes (Y=1) or No (Y=0), etc. Binary responses are ubiquitous; they are one of the most common types of data that statisticians encounter. We are often interested in modeling the probability of success $p$ based on a set of covariates, although sometimes we wish to use those covariates to classify a future observation as a success or a failure.
Examples (a) and (b) above would be considered to have binary responses (Does a student binge drink? Was a patient diagnosed with cancer?), assuming that we have a unique set of covariates for each individual student or patient.
**Binomial Responses:** Also recall from Section \@ref(sec-binomial) that binomial responses are the number of successes in $n$ identical, independent trials with constant probability $p$ of success. A sequence of independent trials like this with the same probability of success is called a **Bernoulli process**. \index{Bernoulli process} As with binary responses, our objective in modeling binomial responses is to quantify how the probability of success, $p$, is associated with relevant covariates.
Example (c) above would be considered to have a binomial response, assuming we have vote totals at the congressional district level rather than information on individual voters.
### Logistic Regression Assumptions
Much like ordinary least squares (OLS), using **logistic regression** \index{logistic regression} to make inferences requires model assumptions.
1. __Binary Response__ The response variable is dichotomous (two possible responses) or the sum of dichotomous responses.
2. __Independence__ The observations must be independent of one another.
3. __Variance Structure__ By definition, the variance of a binomial random variable is $np(1-p)$, so that variability is highest when $p=.5$.
4. __Linearity__ The log of the odds ratio, log($\frac{p}{1-p}$), must be a linear function of $x$. This will be explained further in the context of the first case study.
### A Graphical Look at Logistic Regression
```{r, OLSlogistic, fig.align="center", out.width="60%", fig.cap='Linear vs. logistic regression models for binary response data.', echo=FALSE, warning=FALSE, message=FALSE}
set.seed(0)
dat <- tibble(x=runif(200, -5, 10),
p=exp(-2+1*x)/(1+exp(-2+1*x)),
y=rbinom(200, 1, p),
y2=.3408+.0901*x,
logit=log(p/(1-p)))
dat2 <- tibble(x = c(dat$x, dat$x),
y = c(dat$y2, dat$p),
`Regression model` = c(rep("linear", 200),
rep("logistic", 200)))
ggplot() +
geom_point(data = dat, aes(x, y)) +
geom_line(data = dat2, aes(x, y, linetype = `Regression model`))
#ggplot(dat, aes(x = x)) +
# geom_point(aes(y = y)) +
# geom_smooth(aes(y = y, color = "blue"), method = "lm",
# linetype = 1, se=FALSE) +
# geom_line(aes(y = p, color = "red"), linetype = 2) +
# scale_colour_manual(name = 'Regression model',
# values = c('blue', 'red'),
# labels = c('linear', 'logistic'), guide = 'legend')
```
Figure \@ref(fig:OLSlogistic) illustrates a data set with a binary (0 or 1) response (Y) and a single continuous predictor (X). The solid line is a linear regression fit with least squares to model the probability of a success (Y=1) for a given value of X. With a binary response, the line doesn't fit the data well, and it produces predicted probabilities below 0 and above 1. On the other hand, the logistic regression fit (dashed curve) with its typical "S" shape follows the data closely and always produces predicted probabilities between 0 and 1. For these and several other reasons detailed in this chapter, we will focus on the following model for logistic regression with binary or binomial responses:
\begin{equation*}
log(\frac{p_i}{1-p_i})=\beta_0+\beta_1 x_i
\end{equation*}
where the observed values $Y_i \sim$ binomial with $p=p_i$ for a given $x_i$ and $n=1$ for binary responses.
## Case Studies Overview
We consider three case studies in this chapter. The first two involve binomial responses (Soccer Goalkeepers and Reconstructing Alabama), while the last case uses a binary response (Trying to Lose Weight). Even though binary responses are much more common, their models have a very similar form to binomial responses, so the first two case studies will illustrate important principles that also apply to the binary case. Here are the statistical concepts you will encounter for each case study.
The soccer goalkeeper data can be written in the form of a 2 $\times$ 2 table. This example is used to describe some of the underlying theory for logistic regression. We demonstrate how binomial probability mass functions (pmfs) can be written in one-parameter exponential family form, from which we can identify the canonical link as in Chapter \@ref(ch-glms). Using the canonical link, we write a Generalized Linear Model for binomial counts and determine corresponding MLEs for model coefficients. Interpretation of the estimated parameters involves a fundamental concept, the odds ratio.
The Reconstructing Alabama case study is another binomial example which introduces the notion of deviances, which are used to compare and assess models. Thus, we will investigate hypothesis tests and confidence intervals, including issues of interaction terms, overdispersion, and lack-of-fit. We will also check the assumptions of logistic regression using empirical logit plots and deviance residuals.
The last case study addresses why teens try to lose weight. Here the response is a binary variable which allows us to analyze individual level data. The analysis builds on concepts from the previous sections in the context of a random sample from CDC's Youth Risk Behavior Survey (YRBS).
## Case Study: Soccer Goalkeepers
Does the probability of a save in a soccer match depend upon whether the goalkeeper's team is behind or not? @Roskes2011 looked at penalty kicks in the men's World Cup soccer championships from 1982 to 2010, and they assembled data on 204 penalty kicks during shootouts. The data for this study is summarized in Table \@ref(tab:table1chp6).
```{r,include=FALSE}
State <- c("Behind","Not Behind","Total")
Saves <- c(2,39,41)
Scores <- c(22,141,163)
Tot <- c(24,180,204)
```
```{r, table1chp6, echo=FALSE}
#tab:soccerdata
table1chp6 <- data.frame(State,Saves,Scores,Tot)
colnames(table1chp6) <- c(" ","Saves","Scores","Total")
kable(table1chp6, booktabs=T,
caption="Soccer goalkeepers' penalty kick saves when their team is and is not behind.") %>%
kable_styling(full_width = F) %>%
add_footnote("(Source: Roskes et al. 2011.)", notation = "none")
```
### Modeling Odds
Odds are one way to quantify a goalkeeper's performance. Here the odds that a goalkeeper makes a save when his team is behind is 2 to 22 or 0.09 to 1. Or equivalently, the odds that a goal is scored on a penalty kick is 22 to 2 or 11 to 1. An odds of 11 to 1 tells you that a shooter whose team is ahead will score 11 times for every 1 shot that the goalkeeper saves. When the goalkeeper's team is not behind the odds a goal is scored is 141 to 39 or 3.61 to 1. We see that the odds of a goal scored on a penalty kick are better when the goalkeeper's team is behind than when it is not behind (i.e., better odds of scoring for the shooter when the shooter's team is ahead). We can compare these odds by calculating the __odds ratio__ \index{odds ratio} (OR), 11/3.61 or 3.05, which tells us that the *odds* of a successful penalty kick are 3.05 times higher when the shooter's team is leading.
In our example, it is also possible to estimate the probability of a goal, $p$, for either circumstance. When the goalkeeper's team is behind, the probability of a successful penalty kick is $p$ = 22/24 or 0.833. We can see that the ratio of the probability of a goal scored divided by the probability of no goal is $(22/24)/(2/24)=22/2$ or 11, the odds we had calculated above. The same calculation can be made when the goalkeeper's team is not behind. In general, we now have several ways of finding the odds of success under certain circumstances:
\[\textrm{Odds} = \frac{\# \textrm{successes}}{\# \textrm{failures}}=
\frac{\# \textrm{successes}/n}{\# \textrm{failures}/n}=
\frac{p}{1-p}.\]
### Logistic Regression Models for Binomial Responses
We would like to model the odds of success; however, odds are strictly positive. Therefore, similar to modeling log($\lambda$) in Poisson regression, which allowed the response to take on values from $-\infty$ to $\infty$, we will model the log(odds), the __logit__, \index{logit} in logistic regression. Logits will be suitable for modeling with a linear function of the predictors:
\begin{equation*}
\log\left(\frac{p}{1 - p}\right)=\beta_0+\beta_1X
\end{equation*}
Models of this form are referred to as __binomial regression models__, \index{binomial logistic regression} or more generally as __logistic regression models__. \index{logistic regression} Here we provide intuition for using and interpreting logistic regression models, and then in the short optional section that follows, we present rationale for these models using GLM theory.
In our example we could define $X=0$ for not behind and $X=1$ for behind and fit the model:
\begin{equation}
\log\left(\frac{p_X}{1-p_X}\right)=\beta_0 +\beta_1X
(\#eq:logitXform)
\end{equation}
where $p_X$ is the probability of a successful penalty kick given $X$.
So, based on this model, the log odds of a successful penalty kick when the goalkeeper's team is not behind is:
\[
\log\left(\frac{p_0}{1-p_0}\right) =\beta_0 \nonumber,
\]
and the log odds when the team is behind is:
\[
\log\left(\frac{p_1}{1-p_1}\right)=\beta_0+\beta_1. \nonumber
\]
We can see that $\beta_1$ is the difference between the log odds of a successful penalty kick between games when the goalkeeper's team is behind and games when the team is not behind. Using rules of logs:
\begin{equation*}
\beta_1 = (\beta_0 + \beta_1) - \beta_0 =
\log\left(\frac{p_1}{1-p_1}\right) - \log\left(\frac{p_0}{1-p_0}\right) =
\log\left(\frac{p_1/(1-p_1)}{p_0/{(1-p_0)}}\right).
\end{equation*}
Thus $e^{\beta_1}$ is the ratio of the odds of scoring when the goalkeeper's team is not behind compared to scoring when the team is behind. In general, *exponentiated coefficients in logistic regression are odds ratios (OR)*. A general interpretation of an OR is the odds of success for group A compared to the odds of success for group B---how many times greater the odds of success are in group A compared to group B.
The logit model (Equation \@ref(eq:logitXform)) can also be re-written in a __probability form__:
\begin{equation*}
p_X=\frac{e^{\beta_0+\beta_1X}}{1+e^{\beta_0+\beta_1X}}
\end{equation*}
which can be re-written for games when the goalkeeper's team is behind as:
\begin{equation}
p_1=\frac{e^{\beta_0+\beta_1}}{1+e^{\beta_0+\beta_1}}
(\#eq:pBehindform)
\end{equation}
and for games when the goalkeeper's team is not behind as:
\begin{equation}
p_0=\frac{e^{\beta_0}}{1+e^{\beta_0}}
(\#eq:pNotBehindform)
\end{equation}
We use likelihood methods to estimate $\beta_0$ and $\beta_1$. As we had done in Chapter \@ref(ch-beyondmost), we can write the likelihood for this example in the following form:
\[\Lik(p_1, p_0) = {24 \choose 22}p_1^{22}(1-p_1)^{2}
{180 \choose 141}p_0^{141}(1-p_0)^{39}\]
Our interest centers on estimating $\hat{\beta_0}$ and $\hat{\beta_1}$, not $p_1$ or $p_0$. So we replace $p_1$ in the likelihood with an expression for $p_1$ in terms of $\beta_0$ and $\beta_1$ as in Equation \@ref(eq:pBehindform). Similarly, $p_0$ in Equation \@ref(eq:pNotBehindform) involves only $\beta_0$. After removing constants, the new likelihood looks like:
\begin{equation*}
\begin{gathered}
\Lik(\beta_0,\beta_1) \propto \\
\left( \frac{e^{\beta_0+\beta_1}}{1+e^{\beta_0+\beta_1}}\right)^{22}\left(1- \frac{e^{\beta_0+\beta_1}}{1+e^{\beta_0+\beta_1}}\right)^{2}
\left(\frac{e^{\beta_0}}{1+e^{\beta_0}}\right)^{141}\left(1-\frac{e^{\beta_0}}{1+e^{\beta_0}}\right)^{39}
\end{gathered}
\end{equation*}
Now what? Fitting the model means finding estimates of $\beta_0$ and $\beta_1$, but familiar methods from calculus for maximizing the likelihood don't work here. Instead, we consider all possible combinations of $\beta_0$ and $\beta_1$. That is, we will pick that pair of values for $\beta_0$ and $\beta_1$ that yield the largest likelihood for our data. Trial and error to find the best pair is tedious at best, but more efficient numerical methods are available. The MLEs for the coefficients in the soccer goalkeeper study are $\hat{\beta_0}= 1.2852$ and $\hat{\beta_1}=1.1127$.
```{r, echo=FALSE, comment=NA, include=FALSE}
soccer <- tribble(
~behind, ~scores, ~shots,
#--|--|----
"Yes", 22, 24,
"No", 141, 180
)
soccer <- soccer %>%
mutate(prop1 = scores / shots)
soccer
breg1 <- glm(prop1 ~ behind, weights = shots,
family = binomial(link="logit"), data = soccer )
summary(breg1)
```
Exponentiating $\hat{\beta_1}$ provides an estimate of the odds ratio (the odds of scoring when the goalkeeper's team is behind, compared to the odds of scoring when the team is not behind) of 3.04, which is consistent with our calculations using the 2 $\times$ 2 table. We estimate that the odds of scoring when the goalkeeper's team is behind is over 3 times that of when the team is not behind or, in other words, the odds a shooter is successful in a penalty kick shootout are 3.04 times higher when his team is leading.
\vspace{5mm}
**Time out for study discussion (optional).**
- Discuss the following quote from the study abstract: "Because penalty takers shot at the two sides of the goal equally often, the goalkeepers' right-oriented bias was dysfunctional, allowing more goals to be scored."
- Construct an argument for why the greater success observed when the goalkeeper's team was behind might be better explained from the shooter's perspective.
Before we go on, you may be curious as to why there is *no error term* in our model statements for logistic or Poisson regression. One way to look at it is to consider that all models describe how observed values are generated. With the logistic model we assume that the observations are generated as binomial random variables. Each observation or realization of $Y$ = number of successes in $n$ independent and identical trials with a probability of success on any one trial of $p$ is produced by $Y \sim \textrm{Binomial}(n,p)$. So the randomness in this model is not introduced by an added error term, but rather by appealing to a binomial probability distribution, where variability depends only on $n$ and $p$ through $\textrm{Var}(Y)=np(1-p)$, and where $n$ is usually considered fixed and $p$ the parameter of interest.
### Theoretical Rationale (optional)
Recall from Chapter \@ref(ch-glms) that generalized linear models (GLMs) \index{generalized linear models (GLMs)} are a way in which to model a variety of different types of responses. In this chapter, we apply the general results of GLMs to the specific application of binomial responses. Let $Y$ = the number scored out of $n$ penalty kicks. The parameter, $p$, is the probability of a score on a single penalty kick. Recall that the theory of GLMs is based on the unifying notion of the one-parameter exponential family form:
\begin{equation*}
f(y;\theta)=e^{[a(y)b(\theta)+c(\theta)+d(y)]}
\end{equation*}
To see that we can apply the general approach of GLMs \index{generalized linear models (GLMs)} to binomial responses, we first write an expression for the probability of a binomial response and then use a little algebra to rewrite it until we can demonstrate that it, too, can be written in one-parameter exponential family form with $\theta = p$. This will provide a way in which to specify the canonical link and the form for the model. Additional theory allows us to deduce the mean, standard deviation, and more from this form.
If $Y$ follows a binomial distribution with $n$ trials and probability of success $p$, we can write:
\begin{align*}
P(Y=y)&= \binom{n}{y}p^y(1-p)^{(n-y)} \\
&=e^{y\log(p) + (n-y)\log(1-p) + \log\binom{n}{y}}
\end{align*}
However, this probability mass function is not quite in one-parameter exponential family form. Note that there are two terms in the exponent which consist of a product of functions of $y$ and $p$. So more simplification is in order:
\begin{equation*}
P(Y=y) = e^{y\log\left(\frac{p}{1-p}\right) + n\log(1-p)+ \log\binom{n}{y}}
\end{equation*}
Don't forget to consider the support; we must make sure that the set of possible values for this response is not dependent upon $p$. For fixed $n$ and any value of $p$, $0<p<1$, all integer values from $0$ to $n$ are possible, so the support is indeed independent of $p$.
The one-parameter exponential family form for binomial responses shows that the canonical link is $\log\left(\frac{p}{1-p}\right)$. Thus, GLM theory suggests that constructing a model using the logit, the log odds of a score, as a linear function of covariates is a reasonable approach.
## Case Study: Reconstructing Alabama
This case study demonstrates how wide-ranging applications of statistics can be. Many would not associate statistics with historical research, but this case study shows that it can be done. U.S. Census data from 1870 helped historian Michael Fitzgerald of St. Olaf College gain insight into important questions about how railroads were supported during the Reconstruction Era.
In a paper entitled "Reconstructing Alabama: Reconstruction Era Demographic and Statistical Research," Ben Bayer performs an analysis of data from 1870 to explain factors that influence voting on referendums related to railroad subsidies [@Bayer2011]. Positive votes are hypothesized to be inversely proportional to the distance a voter is from the proposed railroad, but the racial composition of a community (as measured by the percentage of Black residents) is hypothesized to be associated with voting behavior as well. Separate analyses of three counties in Alabama---Hale, Clarke, and Dallas---were performed; we discuss Hale County here. This example differs from the soccer example in that it includes continuous covariates. Was voting on railroad referenda related to distance from the proposed railroad line and the racial composition of a community?
### Data Organization
The unit of observation for this data is a community in Hale County. We will focus on the following variables from `RR_Data_Hale.csv` collected for each community (see Table \@ref(tab:table2chp6)):
- `pctBlack` = the percentage of Black residents in the community
- `distance` = the distance, in miles, the proposed railroad is from the community
- `YesVotes` = the number of "Yes" votes in favor of the proposed railroad line (our primary response variable)
- `NumVotes` = total number of votes cast in the election
```{r, include=FALSE, comment=NA}
community <- c("Carthage","Cederville","Greensboro","Havana")
pctBlack <- c("58.4","92.4","59.4","58.4")
distance <- c(17,7,0,12)
YesVotes <- c(61,0,1790,16)
NumVotes <- c(110,15,1804,68)
```
```{r, table2chp6, echo=FALSE, comment=NA}
#tab:sampleRows
table2chp6 <- data.frame(community, pctBlack, distance, YesVotes, NumVotes)
colnames(table2chp6) <- c("community","pctBlack", "distance","YesVotes","NumVotes")
kable(table2chp6, booktabs=T,
caption="Sample of the data for the Hale County, Alabama, railroad subsidy vote.") %>%
kable_styling(full_width = F)
```
```{r, include=FALSE}
rrHale.df = read.csv("data/RR_Data_Hale.csv")
rrHale.df <- rrHale.df %>%
dplyr::select(community = County, pctBlack = X..Pop..Black,
distance = Distance.from.RR,
YesVotes = Yes.1st.Vote,
NumVotes = Num.Voting.1st) %>%
filter(!is.na(pctBlack)) %>%
mutate(propYes = YesVotes / NumVotes,
InFavor = (propYes > .50))
rrHale.df
# checking correlation betwe predictors
with(rrHale.df,cor(pctBlack,distance))
# overall proportion Yes
sum(rrHale.df$YesVotes) / sum(rrHale.df$NumVotes)
# overall proportion Yes without Grennsboro
sum(rrHale.df$YesVotes[rrHale.df$community!="Greensboro"]) /
sum(rrHale.df$NumVotes[rrHale.df$community!="Greensboro"])
```
### Exploratory Analyses
We first look at a coded scatterplot to see our data. Figure \@ref(fig:coded) portrays the relationship between `distance` and `pctBlack` coded by the `InFavor` status (whether a community supported the referendum with over 50\% Yes votes). From this scatterplot, we can see that all of the communities in favor of the railroad referendum had over 55\% Black residents, and all of those opposed are 7 miles or farther from the proposed line. The overall percentage of voters in Hale County in favor of the railroad is 87.9\%.
```{r, coded, fig.align="center",out.width="60%", fig.cap=' Scatterplot of distance from a proposed rail line and percent Black residents in the community coded by whether the community was in favor of the referendum or not.',echo=FALSE, warning=FALSE}
#Coded Scatterplot
ggplot(rrHale.df, aes(x=distance, y=pctBlack, color=InFavor)) +
geom_point(aes(shape = InFavor), size = 2.5) +
#scale_shape_manual(values=c(1,2)) + # Use a slightly darker palette than normal
ylab("Percent Black residents in the community") +
xlab("Distance to the proposed railroad")
```
Recall that a model with two covariates has the form:
\[\log(\textrm{odds}) = \log\left(\frac{p}{1-p}\right) = \beta_0+\beta_1X_1+\beta_2X_2.\]
where $p$ is the proportion of Yes votes in a community. In logistic regression, we expect the logits to be a linear function of $X$, the predictors. To assess the linearity assumption, we construct **empirical logit plots**, \index{empirical logit plot} where "empirical" means "based on sample data." Empirical logits are computed for each community by taking $\log\left(\frac{\textrm{number of successes}}{\textrm{number of failures}}\right)$. In Figure \@ref(fig:emplogits), we see that the plot of empirical logits versus distance produces a plot that looks linear, as needed for the logistic regression assumption. In contrast, the empirical logits by percent Black residents reveal that Greensboro deviates quite a bit from the otherwise linear pattern; this suggests that Greensboro is an outlier and possibly an influential point. Greensboro has 99.2\% voting yes, with only 59.4\% Black residents.
```{r,emplogits, fig.align="center",out.width="60%", fig.cap='Empirical logit plots for the Railroad Referendum data.',echo=FALSE, warning=FALSE, message=FALSE}
#elogitXdist and elogitXperBlk
# To check the linearity assumption
# Look for outliers
## Empirical logit Plots
# Compute the empirical logits (added 0.5 to avoid log(0))
phat <- with(rrHale.df, (YesVotes+.5)/(NumVotes+1))
rrHale.df$elogit <- log(phat/(1-phat))
rrHale.df$Greensboro <- ifelse(rrHale.df$community=="Greensboro", "Greensboro", NA)
## Plots
logdis <- ggplot(rrHale.df, aes(x=distance, y=elogit))+
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, # Add linear regression line
se=FALSE) + # Don't add shaded confidence region
xlab("distance") + ylab("empirical logits") +
labs(title="Hale Empirical logits by distance")
logblack <- ggplot(rrHale.df, aes(x=pctBlack, y=elogit))+
geom_point(shape=1) + # Use hollow circles
geom_smooth(method=lm, # Add linear regression line
se=FALSE) + # Don't add shaded confidence region
xlab("percent Black residents") + ylab("empirical logits") +
labs(title="Hale Empirical logits by percent Black residents") +
geom_text(aes(label=Greensboro), nudge_x = 7.5, nudge_y = -.5)
grid.arrange(logdis, logblack ,ncol=1)
```
In addition to examining how the response correlates with the predictors, it is a good idea to determine whether the predictors correlate with one another. Here, the correlation between distance and percent Black residents is negative and moderately strong with $r = -0.49$. We'll watch to see if the correlation affects the stability of our odds ratio estimates.
### Initial Models
The first model includes only one covariate, distance.
```{r, comment = NA}
# Model with just distance
model.HaleD <- glm(cbind(YesVotes, NumVotes - YesVotes) ~
distance, family = binomial, data = rrHale.df)
# alternative expression
model.HaleD.alt <- glm(YesVotes / NumVotes ~ distance,
weights = NumVotes, family = binomial, data = rrHale.df)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model.HaleD))
cat(" Residual deviance = ",
summary(model.HaleD)$deviance, " on ",
summary(model.HaleD)$df.residual, "df", "\n",
"Dispersion parameter = ",
summary(model.HaleD)$dispersion)
```
Our estimated binomial regression model is:
\[\log\left(\frac{\hat{p}_i}{1-\hat{p}_i}\right)=3.309-0.288 \textrm{distance}_i\]
where $\hat{p}_i$ is the estimated proportion of Yes votes in community $i$. The estimated odds ratio for distance, that is the exponentiated coefficient for distance, in this model is $e^{-0.288}=0.750$. It can be interpreted as follows: for each additional mile from the proposed railroad, the support (odds of a Yes vote) declines by 25.0\%.
The covariate `pctBlack` is then added to the first model.
```{r, comment = NA}
model.HaleBD <- glm(cbind(YesVotes, NumVotes - YesVotes) ~
distance + pctBlack, family = binomial, data = rrHale.df)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model.HaleBD))
cat(" Residual deviance = ",
summary(model.HaleBD)$deviance, " on ",
summary(model.HaleBD)$df.residual, "df", "\n",
"Dispersion parameter = ",
summary(model.HaleBD)$dispersion)
```
```{r, include=FALSE}
exp(coef(model.HaleBD))
```
Despite the somewhat strong negative correlation between percent Black residents and distance, the estimated odds ratio for distance remains approximately the same in this new model (OR $= e^{-0.29} = 0.747$); controlling for percent Black residents does little to change our estimate of the effect of distance. For each additional mile from the proposed railroad, odds of a Yes vote declines by 25.3\% after adjusting for the racial composition of a community. We also see that, for a fixed distance from the proposed railroad, the odds of a Yes vote declines by 1.3\% (OR $= e^{-.0132} = .987$) for each additional percent of Black residents in the community.
### Tests for Significance of Model Coefficients {#sec-logisticInf}
Do we have statistically significant evidence that support for the railroad referendum decreases with higher proportions of Black residents in a community, after accounting for the distance a community is from the railroad line? As discussed in Section \@ref(cs-philippines) with Poisson regression, there are two primary approaches to testing significance of model coefficients: __Drop-in-deviance test to compare models__ \index{drop-in-deviance test} and __Wald test for a single coefficient__. \index{Wald-type test}
With our larger model given by $\log\left(\frac{p_i}{1-p_i}\right) = \beta_0+\beta_1\textrm{distance}_i+\beta_2\textrm{pctBlack}_i$, the Wald test produces a highly significant p-value ($Z=\frac{-0.0132}{0.0039}= -3.394$, $p=.00069$) indicating significant evidence that support for the railroad referendum decreases with higher proportions of Black residents in a community, after adjusting for the distance a community is from the railroad line.
The drop-in-deviance test would compare the larger model above to the reduced model $\log\left(\frac{p_i}{1-p_i}\right) = \beta_0+\beta_1\textrm{distance}_i$ by comparing residual deviances from the two models.
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(model.HaleD, model.HaleBD, test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(ResidDF=drop_in_dev$`Resid. Df`,
ResidDev=drop_in_dev$`Resid. Dev`,
Deviance=drop_in_dev$Deviance, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chi)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
```
The drop-in-deviance test statistic is $318.44 - 307.22 = 11.22$ on $9 - 8 = 1$ df, producing a p-value of .00081, in close agreement with the Wald test.
A third approach to determining significance of $\beta_2$ would be to generate a 95\% confidence interval and then checking if 0 falls within the interval or, equivalently, if 1 falls within a 95\% confidence interval for $e^{\beta_2}.$ The next section describes two approaches to producing a confidence interval for coefficients in logistic regression models.
### Confidence Intervals for Model Coefficients
Since the Wald statistic follows a normal distribution with $n$ large, we could generate a Wald-type (normal-based) confidence interval \index{Wald-type confidence interval} for $\beta_2$ using:
\[\hat\beta_2 \pm 1.96\cdot\textrm{SE}(\hat\beta_2)\]
and then exponentiating endpoints if we prefer a confidence interval for the odds ratio $e^{\beta_2}$. In this case,
\begin{align*}
95\% \textrm{ CI for } \beta_2 &= \hat{\beta}_2 \pm 1.96 \cdot \textrm{SE}(\hat{\beta}_2) \\
&= -0.0132 \pm 1.96 \cdot 0.0039 \\
&= -0.0132 \pm 0.00764 \\
&= (-0.0208, -0.0056) \\
95\% \textrm{ CI for } e^{\beta_2} &= (e^{-0.0208}, e^{-0.0056}) \\
&= (.979, .994) \\
95\% \textrm{ CI for } e^{10\beta_2} &= (e^{-0.208}, e^{-0.056}) \\
&= (.812, .946)
\end{align*}
Thus, we can be 95\% confident that a 10\% increase in the proportion of Black residents is associated with a 5.4\% to 18.8\% decrease in the odds of a Yes vote for the railroad referendum after controlling for distance. This same relationship could be expressed as (a) between a 0.6\% and a 2.1\% decrease in odds for each 1\% increase in the Black population, or (b) between a 5.7\% ($1/e^{-.056}$) and a 23.1\% ($1/e^{-.208}$) increase in odds for each 10\% decrease in the Black population, after adjusting for distance. Of course, with $n=11$, we should be cautious about relying on a Wald-type interval in this example.
Another approach available in R is the **profile likelihood method**, \index{profile likelihood} similar to Section \@ref(cs-philippines).
```{r, comment=NA, message=FALSE}
exp(confint(model.HaleBD))
```
In the model with `distance` and `pctBlack`, the profile likelihood 95\% confidence interval for $e^{\beta_2}$ is (.979, .994), which is approximately equal to the Wald-based interval despite the small sample size. We can also confirm the statistically significant association between percent Black residents and odds of voting Yes (after controlling for distance), because 1 is not a plausible value of $e^{\beta_2}$ (where an odds ratio of 1 would imply that the odds of voting Yes do not change with percent Black residents).
### Testing for Goodness-of-Fit
As in Section \@ref(sec-PoisGOF), we can evaluate the goodness-of-fit for our model by comparing the residual deviance (307.22) to a $\chi^2$ distribution with $n-p$ (8) degrees of freedom.
```{r, gof2, comment=NA}
1-pchisq(307.2173, 8) # Goodness-of-fit test
```
The model with `pctBlack` and `distance` has statistically significant evidence of lack-of-fit ($p<.001$).
Similar to the Poisson regression models, this lack-of-fit \index{lack-of-fit} could result from (a) missing covariates, (b) outliers, or (c) overdispersion. We will first attempt to address (a) by fitting a model with an interaction between distance and percent Black residents, to determine whether the effect of racial composition differs based on how far a community is from the proposed railroad.
```{r, comment = NA}
model.HaleBxD <- glm(cbind(YesVotes, NumVotes - YesVotes) ~
distance + pctBlack + distance:pctBlack,
family = binomial, data = rrHale.df)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model.HaleBxD))
cat(" Residual deviance = ",
summary(model.HaleBxD)$deviance, " on ",
summary(model.HaleBxD)$df.residual, "df", "\n",
"Dispersion parameter = ",
summary(model.HaleBxD)$dispersion)
```
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(model.HaleBD, model.HaleBxD,
test = "Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(ResidDF=drop_in_dev$`Resid. Df`,
ResidDev=drop_in_dev$`Resid. Dev`,
Deviance=drop_in_dev$Deviance, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chi)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
```
We have statistically significant evidence (Wald test: $Z = 5.974, p<.001$; Drop-in-deviance test: $\chi^2=32.984, p<.001$) that the effect of the proportion of community residents who are Black on the odds of voting Yes depends on the distance of the community from the proposed railroad.
To interpret the interaction coefficient in context, we will compare two cases: one where a community is right on the proposed railroad (`distance` = 0), and the other where the community is 15 miles away (`distance` = 15). The significant interaction implies that the effect of `pctBlack` should differ in these two cases. In the first case, the coefficient for `pctBlack` is -0.0647, while in the second case, the relevant coefficient is $-0.0647+15(.00537) = 0.0158$. Thus, for a community right on the proposed railroad, a 1\% increase in percent Black residents is associated with a 6.3\% ($e^{-.0647}=.937$) decrease in the odds of voting Yes, while for a community 15 miles away, a 1\% increase in percent Black residents is associated with a ($e^{.0158}=1.016$) 1.6\% *increase* in the odds of voting Yes. A significant interaction term doesn't always imply a change in the direction of the association, but it does here.
Because our interaction model still exhibits lack-of-fit (residual deviance of 274.23 on just 7 df), and because we have used the covariates at our disposal, we will assess this model for potential outliers and overdispersion by examining the model's residuals.
### Residuals for Binomial Regression
With LLSR, residuals were used to assess model assumptions and identify outliers. For binomial regression, two different types of residuals are typically used. One residual, the **Pearson residual**, \index{Pearson residuals} has a form similar to that used with LLSR. Specifically, the Pearson residual is calculated using:
\begin{equation*}
\textrm{Pearson residual}_i = \frac{\textrm{actual count}-\textrm{predicted count}}{\textrm{SD of count}} =
\frac{Y_i-m_i\hat{p_i}}{\sqrt{m_i\hat{p_i}(1-\hat{p_i})}}
\end{equation*}
where $m_i$ is the number of trials for the $i^{th}$ observation and $\hat{p}_i$ is the estimated probability of success for that same observation.
A __deviance residual__ \index{deviance residuals} is an alternative residual for binomial regression based on the discrepancy between the observed values and those estimated using the likelihood.
A deviance residual can be calculated for each observation using:
\begin{equation*}
\textrm{d}_i =
\textrm{sign}(Y_i-m_i\hat{p_i})\sqrt{2[Y_i \log\left(\frac{Y_i}{m_i \hat{p_i}}\right)+
(m_i - Y_i) \log\left(\frac{m_i - Y_i}{m_i - m_i \hat{p_i}}\right)]}
\end{equation*}
When the number of trials is large for all of the observations and the models are appropriate, both sets of residuals should follow a standard normal distribution.
The sum of the individual deviance residuals is referred to as the **deviance** or **residual deviance**. \index{residual deviance} The residual deviance is used to assess the model. As the name suggests, a model with a small deviance is preferred. In the case of binomial regression, when the denominators, $m_i$, are large and a model fits, the residual deviance follows a $\chi^2$ distribution with $n-p$ degrees of freedom (the residual degrees of freedom). Thus for a good fitting model the residual deviance should be approximately equal to its corresponding degrees of freedom. When binomial data meets these conditions, the deviance can be used for a goodness-of-fit test. The p-value for lack-of-fit is the proportion of values from a $\chi_{n-p}^2$ distribution that are greater than the observed residual deviance.
We begin a residual analysis of our interaction model by plotting the residuals against the fitted values in Figure \@ref(fig:resid). This kind of plot for binomial regression would produce two linear trends with similar negative slopes if there were equal sample sizes $m_i$ for each observation.
```{r, resid,fig.align="center",out.width="60%", fig.cap='Fitted values by residuals for the interaction model for the Railroad Referendum data.',echo=FALSE, warning=FALSE, comment=NA}
rrHale.df <- rrHale.df %>%
mutate(resid.BxD = residuals(model.HaleBxD),
fit.BxD = fitted.values(model.HaleBxD))
ggplot(rrHale.df, aes(x = fit.BxD, y = resid.BxD)) +
geom_point() +
geom_text(aes(label=Greensboro), nudge_x = -.075) +
xlab("Fitted values from interaction model") +
ylab("Deviance residuals from interaction model")
```
From this residual plot, Greensboro does not stand out as an outlier. If it did, we could remove Greensboro and refit our interaction model, checking to see if model coefficients changed in a noticeable way. Instead, we will continue to include Greensboro in our modeling efforts. Because the large residual deviance cannot be explained by outliers, and given we have included all of the covariates at hand as well as an interaction term, the observed binomial counts are likely overdispersed. This means that they exhibit more variation than the model would suggest, and we must consider ways to handle this overdispersion.
### Overdispersion {#sec-logOverdispersion}
Similar to Poisson regression, we can adjust for overdispersion \index{overdispersion} in binomial regression. With overdispersion there is __extra-binomial variation__, \index{extra-binomial variation} so the actual variance will be greater than the variance of a binomial variable, $np(1-p)$. One way to adjust for overdispersion is to estimate a multiplier (dispersion parameter), $\hat{\phi}$, for the variance that will inflate it and reflect the reduction in the amount of information we would otherwise have with independent observations. We used a similar approach to adjust for overdispersion in a Poisson regression model in Section \@ref(sec-overdispPois), and we will use the same estimate here: $\hat\phi=\frac{\sum(\textrm{Pearson residuals})^2}{n-p}$.
When overdispersion is adjusted for in this way, we can no longer use maximum likelihood to fit our regression model; instead we use a quasilikelihood approach. \index{quasilikelihood} Quasilikelihood is similar to likelihood-based inference, but because the model uses the dispersion parameter, it is not a binomial model with a true likelihood (we call it **quasibinomial**). \index{quasibinomial} R offers quasilikelihood as an option when model fitting. The quasilikelihood approach will yield the same coefficient point estimates as maximum likelihood; however, the variances will be larger in the presence of overdispersion (assuming $\phi>1$). We will see other ways in which to deal with overdispersion and clusters in the remaining chapters in the book, but the following describes how overdispersion is accounted for using $\hat{\phi}$:
\vspace{5mm}
__Summary: accounting for overdispersion__
- Use the dispersion parameter $\hat\phi=\frac{\sum(\textrm{Pearson residuals})^2}{n-p}$ to inflate standard errors of model coefficients.
- Wald test statistics: multiply the standard errors by $\sqrt{\hat{\phi}}$ so that $\textrm{SE}_\textrm{Q}(\hat\beta)=\sqrt{\hat\phi}\cdot\textrm{SE}(\hat\beta)$ and conduct tests using the $t$-distribution.
- Confidence intervals use the adjusted standard errors and multiplier based on $t$, so they are thereby wider: $\hat\beta \pm t_{n-p} \cdot \textrm{SE}_\textrm{Q}(\hat\beta)$.
- Drop-in-deviance test statistic comparing Model 1 (larger model with $p$ parameters) to Model 2 (smaller model with $q<p$ parameters) is $F = \frac{1}{\hat\phi} \cdot \frac{D_2 - D_1}{p-q}$ where $D_1$ and $D_2$ are the residual deviances for models 1 and 2, respectively, and $p-q$ is the difference in the number of parameters for the two models. Note that both $D_2-D_1$ and $p-q$ are positive. This test statistic is compared to an F-distribution with $p-q$ and $n-p$ degrees of freedom.
Output for a model which adjusts our interaction model for overdispersion appears below, where $\hat{\phi}=51.6$ is used to adjust the standard errors for the coefficients and the drop-in-deviance tests during model building. Standard errors will be inflated by a factor of $\sqrt{51.6}=7.2$. As a result, there are no significant terms in the adjusted interaction model below.
```{r, comment = NA}
model.HaleBxDq <- glm(cbind(YesVotes, NumVotes - YesVotes) ~
distance + pctBlack + distance:pctBlack,
family = quasibinomial, data = rrHale.df)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model.HaleBxDq))
cat(" Residual deviance = ",
summary(model.HaleBxDq)$deviance, " on ",
summary(model.HaleBxDq)$df.residual, "df", "\n",
"Dispersion parameter = ",
summary(model.HaleBxDq)$dispersion)
```
We therefore remove the interaction term and refit the model, adjusting for the extra-binomial variation that still exists.
```{r, comment = NA}
model.HaleBDq <- glm(cbind(YesVotes, NumVotes - YesVotes) ~
distance + pctBlack,
family = quasibinomial, data = rrHale.df)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model.HaleBDq))
cat(" Residual deviance = ",
summary(model.HaleBDq)$deviance, " on ",
summary(model.HaleBDq)$df.residual, "df", "\n",
"Dispersion parameter = ",
summary(model.HaleBDq)$dispersion)
```
By removing the interaction term and using the overdispersion parameter, we see that distance is significantly associated with support, but percent Black residents is no longer significant after adjusting for distance.
```{r, include=FALSE}
exp(coef(model.HaleBDq))
```
Because quasilikelihood methods do not change estimated coefficients, we still estimate a 25\% decline $(1-e^{-0.292})$ in support for each additional mile from the proposed railroad (odds ratio of .75).
```{r, comment=NA, message=FALSE}
exp(confint(model.HaleBDq))
```
While we previously found a 95\% confidence interval for the odds ratio associated with distance of (.728, .766), our confidence interval is now much wider: (.609, .871). Appropriately accounting for overdispersion has changed both the significance of certain terms and the precision of our coefficient estimates.
### Summary
We began by fitting a logistic regression model with `distance` alone. Then we added the covariate `pctBlack`, and the Wald-type test and the drop-in-deviance test both provided strong support for the addition of `pctBlack` to the model. The model with `distance` and `pctBlack` had a large residual deviance suggesting an ill-fitted model. When we looked at the residuals, we saw that Greensboro is an extreme observation. Models without Greensboro were fitted and compared to our initial models. Seeing no appreciable improvement or differences with Greensboro removed, we left it in the model. There remained a large residual deviance so we attempted to account for it by using an estimated dispersion parameter similar to Section \@ref(sec-overdispPois) with Poisson regression. The final model included distance and percent Black residents, although percent Black residents was no longer significant after adjusting for overdispersion.
## Linear Least Squares \index{linear least squares regression (LLSR)} vs. Binomial Regression \index{binomial logistic regression}
\begin{gather*}
\underline{\textrm{Response}} \\
\mathbf{LLSR:}\textrm{ normal} \\
\mathbf{Binomial\ Regression:}\textrm{ number of successes in n trials} \\
\textrm{ } \\
\underline{\textrm{Variance}} \\
\mathbf{LLSR:}\textrm{ equal for each level of}\ X \\
\mathbf{Binomial\ Regression:}\ np(1-p)\textrm{ for each level of}\ X \\
\textrm{ } \\
\underline{\textrm{Model Fitting}} \\
\mathbf{LLSR:}\ \mu=\beta_0+\beta_1x \textrm{ using Least Squares}\\
\mathbf{Binomial\ Regression:}\ \log\left(\frac{p}{1-p}\right)=\beta_0+\beta_1x \textrm{ using Maximum Likelihood}\\
\textrm{ } \\
\underline{\textrm{EDA}} \\
\mathbf{LLSR:}\textrm{ plot $X$ vs. $Y$; add line} \\
\mathbf{Binomial\ Regression:}\textrm{ find $\log(\textrm{odds})$ for several subgroups; plot vs. $X$} \\
\end{gather*}
\begin{gather*}
\underline{\textrm{Comparing Models}} \\
\mathbf{LLSR:}\textrm{ extra sum of squares F-tests; AIC/BIC} \\
\mathbf{Binomial\ Regression:}\textrm{ drop-in-deviance tests; AIC/BIC} \\
\textrm{ } \\
\underline{\textrm{Interpreting Coefficients}} \\
\mathbf{LLSR:}\ \beta_1=\textrm{ change in mean response for unit change in $X$} \\
\mathbf{Binomial\ Regression:}\ e^{\beta_1}=\textrm{ percent change in odds for unit change in $X$}
\end{gather*}
## Case Study: Trying to Lose Weight
The final case study uses individual-specific information so that our response, rather than the number of successes out of some number of trials, is simply a binary variable taking on values of 0 or 1 (for failure/success, no/yes, etc.). This type of problem---__binary logistic regression__---is exceedingly common in practice. \index{binary logistic regression} Here we examine characteristics of young people who are trying to lose weight. The prevalence of obesity among U.S. youth suggests that wanting to lose weight is sensible and desirable for some young people such as those with a high body mass index (BMI). On the flip side, there are young people who do not need to lose weight but make ill-advised attempts to do so nonetheless. A multitude of studies on weight loss focus specifically on youth and propose a variety of motivations for the young wanting to lose weight; athletics and the media are two commonly cited sources of motivation for losing weight for young people.
Sports have been implicated as a reason for young people wanting to shed pounds, but not all studies are consistent with this idea. For example, a study by @Martinsen2009 reported that, despite preconceptions to the contrary, there was a higher rate of self-reported eating disorders among controls (non-elite athletes) as opposed to elite athletes. Interestingly, the kind of sport was not found to be a factor, as participants in leanness sports (for example, distance running, swimming, gymnastics, dance, and diving) did not differ in the proportion with eating disorders when compared to those in non-leanness sports. So, in our analysis, we will not make a distinction between different sports.
Other studies suggest that mass media is the culprit. They argue that students' exposure to unrealistically thin celebrities may provide unhealthy motivation for some, particularly young women, to try to slim down. An examination and analysis of a large number of related studies (referred to as a __meta-analysis__) [@Grabe2008] found a strong relationship between exposure to mass media and the amount of time that adolescents spend talking about what they see in the media, deciphering what it means, and figuring out how they can be more like the celebrities.
We are interested in the following questions: Are the odds that young females report trying to lose weight greater than the odds that males do? Is increasing BMI associated with an interest in losing weight, regardless of sex? Does sports participation increase the desire to lose weight? Is media exposure associated with more interest in losing weight?
We have a sample of 500 teens from data collected in 2009 through the U.S. Youth Risk Behavior Surveillance System (YRBSS) [@YRBS2009]. The YRBSS is an annual national school-based survey conducted by the Centers for Disease Control and Prevention (CDC) and state, territorial, and local education and health agencies and tribal governments. More information on this survey can be found [here](http://www.cdc.gov/HealthyYouth/yrbs/index.htm).
### Data Organization
Here are the three questions from the YRBSS we use for our investigation:
Q66. Which of the following are you trying to do about your weight?
- A. Lose weight
- B. Gain weight
- C. Stay the same weight
- D. I am not trying to do anything about my weight
Q81. On an average school day, how many hours do you watch TV?
- A. I do not watch TV on an average school day
- B. Less than 1 hour per day
- C. 1 hour per day
- D. 2 hours per day
- E. 3 hours per day
- F. 4 hours per day
- G. 5 or more hours per day
Q84. During the past 12 months, on how many sports teams did you play? (Include any teams run by your school or community groups.)
- A. 0 teams
- B. 1 team
- C. 2 teams
- D. 3 or more teams
Answers to Q66 are used to define our response variable: Y = 1 corresponds to "(A) trying to lose weight", while Y = 0 corresponds to the other non-missing values. Q84 provides information on students' sports participation and is treated as numerical, 0 through 3, with 3 representing 3 or more. As a proxy for media exposure, we use answers to Q81 as numerical values 0, 0.5, 1, 2, 3, 4, and 5, with 5 representing 5 or more. Media exposure and sports participation are also considered as categorical factors, that is, as variables with distinct levels which can be denoted by indicator variables as opposed to their numerical values.
BMI is included in this study as the percentile for a given BMI for members of the same sex. This facilitates comparisons when modeling with males and females. We will use the terms *BMI* and *BMI percentile* interchangeably with the understanding that we are always referring to the percentile.
With our sample, we use only the cases that include all of the data for these four questions. This is referred to as a __complete case analysis__. That brings our sample of 500 to 445. There are limitations of complete case analyses that we address in the Discussion.
### Exploratory Data Analysis
```{r, eval=FALSE, include=FALSE}
risk2009.data = foreign::read.spss("data/yrbs09.sav", to.data.frame=TRUE)
names(risk2009.data)
set.seed(33)
risk2009.samp <- risk2009.data %>%
sample_n(500)
rm(risk2009.data)
# Data Prep
sex = with(risk2009.samp,Q2)
sex=with(risk2009.samp,ifelse(sex=="Female",0,1))
sex.l=with(risk2009.samp,factor(sex, labels = c("Female", "Male")))
table(sex.l)
lose.wt4 = with(risk2009.samp,Q66)
table(lose.wt4)
lose.wt2= with(risk2009.samp,ifelse(Q66=="Lose weight",1,0))
table(lose.wt2)
lose.wt.l=with(risk2009.samp,factor(lose.wt2, labels = c("No weight loss", "Lose weight")))
table(lose.wt.l)
sport4 = with(risk2009.samp,Q84)
table(sport4)
sport = with(risk2009.samp,ifelse(Q84=="0 teams",0,1))
sport.l=with(risk2009.samp,factor(sport, labels = c("No sports", "Sports")))
table(sport.l)
media=with(risk2009.samp,as.numeric(Q81)) # hours of TV per day
table(media)
media=media-2
table(media)
media=ifelse(media==0,0.5,media)
media=ifelse(media==-1,0,media)
table(media)
risk2009 <- risk2009.samp %>%
mutate(sex = sex.l, lose.wt = lose.wt.l, sport = sport.l,
media = media, lose.wt.01 = lose.wt2) %>%
mutate(sport_fac = sport4,
sport_num = parse_number(as.character(sport4))) %>%
dplyr::select(lose.wt, lose.wt.01, sex, sport, sport_fac,
sport_num, media, bmipct) %>%
filter(complete.cases(.))
rm(risk2009.samp)
write_csv(risk2009, "data/risk2009.csv")
```
```{r, include = FALSE}
risk2009 <- read_csv("data/risk2009.csv")
risk2009 <- risk2009 %>%
mutate(female = ifelse(sex=="Female", 1, 0))
#primary response
prop.table( mosaic::tally(~ lose.wt, data = risk2009) )
#covariates
prop.table( mosaic::tally(~ sex, data = risk2009) )
prop.table( mosaic::tally(~ sport, data = risk2009) )
prop.table( mosaic::tally(~ media, data = risk2009) )
risk2009 %>%
summarise(meanBMI = mean(bmipct),
medianBMI = median(bmipct))
#covariates vs losing weight
prop.table( mosaic::tally(lose.wt ~ sex, data = risk2009), 2)
prop.table( mosaic::tally(lose.wt ~ sport,
data = risk2009), 2)
prop.table( mosaic::tally(lose.wt ~ media,
data = risk2009), 2)
```
Nearly half (44.7\%) of our sample of 445 youths report that they are trying to lose weight, 48.1\% of the sample are females, and 59.3\% play on one or more sports teams. Also, 8.8\% report that they do not watch any TV on school days, whereas another 13.0\% watched 5 or more hours each day. Interestingly, the median BMI percentile for our 445 youths is 68. The most dramatic difference in the proportions of those who are trying to lose weight is by sex; 58\% of the females want to lose weight in contrast to only 32\% of the males (see Figure \@ref(fig:mosaicsexlose)). This provides strong support for the inclusion of a sex term in every model considered.
```{r, mosaicsexlose, fig.align="center", out.width="60%", fig.cap='Weight loss plans vs. sex.',echo=FALSE, warning=FALSE, comment=NA}
#ytable <- mosaic::tally(~ sex + lose.wt, data = risk2009)
#mosaicplot(ytable, color=c("blue", "light blue"), main="")
#prop.table(ytable, 1)
ggplot(data = risk2009, aes(x = sex, fill = lose.wt)) +
geom_bar(position = "fill") +
ylab("Proportion") + scale_fill_grey()
```
```{r, echo=FALSE, include=FALSE, comment=NA}
risk2009 %>%
group_by(sex, lose.wt) %>%
summarise(mean = mean(bmipct), sd = sd(bmipct), n = n())
```
```{r, table3chp6, echo=FALSE, comment=NA}
sex <- c("Female"," ","Male"," ")
wls <- c("No weight loss","Lose weight","No weight loss","Lose weight")
mbmip <- c("43.2","72.4","58.8","85.7")
sds <- c("25.8", "23.0", "28.2", "18.0")
ncol <- c(89, 125, 157, 74)
#ccbmiXsex.tab
table3chp6 <- data.frame(sex,wls,mbmip,sds,ncol)
colnames(table3chp6) <- c("Sex","Weight loss status","mean BMI percentile","SD","n")
kable(table3chp6, booktabs=T,caption="Mean BMI percentile by sex and desire to lose weight.")
```
Table \@ref(tab:table3chp6) displays the mean BMI of those wanting and not wanting to lose weight for males and females. The mean BMI is greater for those trying to lose weight compared to those not trying to lose weight, regardless of sex. The size of the difference is remarkably similar for the two sexes.
If we consider including a BMI term in our model(s), the logit should be linearly related to BMI. We can investigate this assumption by constructing an empirical logit plot. In order to calculate empirical logits, we first divide our data by sex. Within each sex, we generate 10 groups of equal sizes, the first holding the bottom 10\% in BMI percentile for that sex, the second holding the next lowest 10\%, etc. Within each group, we calculate the proportion, $\hat{p}$ that reported wanting to lose weight, and then the empirical log odds, $log(\frac{\hat{p}}{1-\hat{p}})$, that a young person in that group wants to lose weight.
```{r, logitBMIsex, fig.align="center",out.width="60%", fig.cap='Empirical logits of trying to lose weight by BMI and sex.',echo=FALSE, warning=FALSE, comment=NA, message=FALSE}
risk2009 <- risk2009 %>%
group_by(sex) %>%
mutate(BMIcuts = cut_number(bmipct,10))
emplogit1 <- risk2009 %>%
group_by(sex, BMIcuts) %>%
summarise(prop.lose = mean(lose.wt.01),
n = n(),
midpoint = median(bmipct)) %>%
mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose),
emplogit = log(prop.lose / (1 - prop.lose)))
ggplot(emplogit1, aes(x = midpoint, y = emplogit, color = sex)) +
geom_point(aes(shape = sex)) +
geom_smooth(aes(linetype = sex), method = "lm") +
xlab("BMI percentile") + ylab("Empirical logits")
```
Figure \@ref(fig:logitBMIsex) presents the empirical logits for the BMI intervals by sex. Both males and females exhibit an increasing linear trend on the logit scale indicating that increasing BMI is associated with a greater desire to lose weight and that modeling log odds as a linear function of BMI is reasonable. The slope for the females appears to be similar to the slope for males, so we do not need to consider an interaction term between BMI and sex in the model.
```{r, mosaicsexsports, fig.align="center", out.width="60%", fig.cap='Weight loss plans vs. sex and sports participation.',echo=FALSE, warning=FALSE, comment=NA}
#ytable <- mosaic::tally(~ sex + sport + lose.wt, data = risk2009)
#mosaicplot(ytable, color = T, main="")
#prop.table(ytable, c(1,2))
ggplot(data = risk2009, aes(x = sport, fill = lose.wt)) +
geom_bar(position = "fill") +
facet_wrap(~sex) +
ylab("Proportion") + scale_fill_grey()
```
Out of those who play sports, 44\% want to lose weight, whereas 46\% want to lose weight among those who do not play sports. Figure \@ref(fig:mosaicsexsports) compares the proportion of respondents who want to lose weight by their sex and sport participation. The data suggest that sports participation is associated with the same or even a slightly lower desire to lose weight, contrary to what had originally been hypothesized. While the overall levels of those wanting to lose weight differ considerably between the sexes, the differences between those in and out of sports within sex appear to be very small. A term for sports participation or number of teams will be considered, but there is not compelling evidence that an interaction term will be needed.
It was posited that increased exposure to media, here measured as hours of TV daily, is associated with increased desire to lose weight, particularly for females. Overall, the percentage who want to lose weight ranges from 38\% of those watching 5 hours of TV per day to 55\% among those watching 2 hours daily. There is minimal variation in the proportion wanting to lose weight with both sexes combined. However, we are more interested in differences between the sexes (see Figure \@ref(fig:mediaXsex)). We create empirical logits using the proportion of students trying to lose weight for each level of hours spent watching TV daily and look at the trends in the logits separately for males and females. From Figure \@ref(fig:logitmediasex), there does not appear to be a linear relationship for males or females.
```{r,mediaXsex, fig.align="center",out.width="60%", fig.cap='Weight loss plans vs. daily hours of TV and sex.',echo=FALSE, warning=FALSE, comment=NA, message=FALSE}
#ytable <- mosaic::tally(~ sex + media + lose.wt, data = risk2009)
#mosaicplot(ytable, color = T, main="")
#prop.table(ytable, c(1,2))
ggplot(data = risk2009, aes(x = as.factor(media),
fill = lose.wt)) +
geom_bar(position = "fill") +
facet_wrap(~sex) +
ylab("Proportion") + scale_fill_grey()
```
```{r, logitmediasex, fig.align="center",out.width="60%", fig.cap='Empirical logits for the odds of trying to lose weight by TV watching and sex.',echo=FALSE, warning=FALSE, message=FALSE}
emplogit2 <- risk2009 %>%
group_by(sex, media) %>%
summarise(prop.lose = mean(lose.wt.01), n = n()) %>%
mutate(prop.lose = ifelse(prop.lose==0, .01, prop.lose)) %>%
mutate(emplogit = log(prop.lose / (1 - prop.lose)))
ggplot(emplogit2, aes(x = media, y = emplogit, color = sex)) +
geom_point(aes(shape = sex)) +
geom_smooth(aes(linetype = sex), method = "lm") +
xlab("TV hours per day") + ylab("Empirical logits")
```
### Initial Models
Our strategy for modeling is to use our questions of interest and what we have learned in the exploratory data analysis. For each model we interpret the coefficient of interest, look at the corresponding Wald test and, as a final step, compare the deviances for the different models we considered.
We first use a model where sex is our only predictor.
```{r, comment = NA}
model1 <- glm(lose.wt.01 ~ female, family = binomial,
data = risk2009)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model1))
cat(" Residual deviance = ",
summary(model1)$deviance, " on ",
summary(model1)$df.residual, "df")
```
Our estimated binomial regression model is:
\[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)=-0.75+1.09 \textrm{female}\]
where $\hat{p}$ is the estimated proportion of youth wanting to lose weight. We can interpret the coefficient on `female` by exponentiating $e^{1.0919} = 2.98$ (95% CI = $(2.03, 4.41)$) indicating that the odds of a female trying to lose weight is nearly three times the odds of a male trying to lose weight ($Z=5.520$, $p=3.38e-08$). We retain sex in the model and consider adding the BMI percentile:
```{r, include=FALSE}
exp(coef(model1))
exp(confint(model1))
```
```{r, 6model2int, echo=FALSE, include=FALSE, comment=NA}
model2int <- glm(lose.wt.01 ~ female + bmipct + female:bmipct,
family = binomial, data = risk2009)
summary(model2int)
```
```{r, comment = NA}
model2 <- glm(lose.wt.01 ~ female + bmipct,
family = binomial, data = risk2009)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model2))
cat(" Residual deviance = ",
summary(model2)$deviance, " on ",
summary(model2)$df.residual, "df")
```
We see that there is statistically significant evidence ($Z=8.997, p<.001$) that BMI is positively associated with the odds of trying to lose weight, after controlling for sex. Clearly BMI percentile belongs in the model with sex.
Our estimated binomial regression model is:
\[\log\left(\frac{\hat{p}}{1-\hat{p}}\right)= -4.26+1.86\textrm{female}+0.047\textrm{bmipct}\]
To interpret the coefficient on `bmipct`, we will consider a 10-unit increase in `bmipct`. Because $e^{10*0.047}=1.602$, then there is an estimated 60.2\% increase in the odds of wanting to lose weight for each additional 10 percentile points of BMI for members of the same sex. Just as we had done in other multiple regression models, we need to interpret our coefficient *given that the other variables remain constant*. An interaction term for BMI by sex was tested (not shown) and it was not significant ($Z=-0.70$, $p=0.485$), so the effect of BMI does not differ by sex.
We next add `sport` to our model. Sports participation was considered for inclusion in the model in three ways: an indicator of sports participation (0 = no teams, 1 = one or more teams), treating the number of teams (0, 1, 2, or 3) as numeric, and treating the number of teams as a factor. The models below treat sports participation using an indicator variable, but all three models produced similar results.
```{r, comment = NA}
model3 <- glm(lose.wt.01 ~ female + bmipct + sport,
family = binomial, data = risk2009)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model3))
cat(" Residual deviance = ",
summary(model3)$deviance, " on ",
summary(model3)$df.residual, "df")
```
```{r, comment = NA}
model3int <- glm(lose.wt.01 ~ female + bmipct + sport +
female:sport + bmipct:sport,
family = binomial, data = risk2009)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model3int))
cat(" Residual deviance = ",
summary(model3int)$deviance, " on ",
summary(model3int)$df.residual, "df")
```
```{r, eval=FALSE, include=FALSE}
model3a <- glm(lose.wt.01 ~ female + bmipct + sport_fac,
family = binomial, data = risk2009)
summary(model3a)
model3b <- glm(lose.wt.01 ~ female + bmipct + sport_num,
family = binomial, data = risk2009)
summary(model3b)
```
Sports teams were not significant in any of these models, nor were interaction terms (sex by sports and bmipct by sports). As a result, sports participation was no longer considered for inclusion in the model.
We last look at adding `media` to our model.
```{r, comment = NA}
model4 <- glm(lose.wt.01 ~ female + bmipct + media,
family = binomial, data = risk2009)
```
```{r, echo=FALSE, message=FALSE}
coef(summary(model4))
cat(" Residual deviance = ",
summary(model4)$deviance, " on ",
summary(model4)$df.residual, "df")
```
Media is not a statistically significant term ($Z=-1.371$, $p=0.170$). However, because our interest centers on how media may affect attempts to lose weight and how its effect might be different for females and males, we fit a model with a media term and a sex by media interaction term (not shown). Neither term was statistically significant, so we have no support in our data that media exposure as measured by hours spent watching TV is associated with the odds a teen is trying to lose weight after accounting for sex and BMI.
```{r, eval=FALSE, include=FALSE}
model4a <- glm(lose.wt.01 ~ female + bmipct + media +
media:female, family = binomial, data = risk2009)
summary(model4a)
```
### Drop-in-Deviance Tests
```{r comment=NA, message=FALSE}
drop_in_dev <- anova(model1, model2, model3, model4,
test="Chisq")
```
```{r comment=NA, message=F, echo=F}
did_print <- data.frame(ResidDF=drop_in_dev$`Resid. Df`,
ResidDev=drop_in_dev$`Resid. Dev`,
Deviance=drop_in_dev$Deviance, Df=drop_in_dev$Df,
pval=drop_in_dev$`Pr(>Chi)`)
row.names(did_print) <- row.names(drop_in_dev)
did_print
AIC(model1, model2, model3, model4)
```