-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy path03-log-reg.Rmd
1236 lines (955 loc) · 69.4 KB
/
03-log-reg.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
717
718
719
720
721
722
723
724
725
726
727
728
729
730
731
732
733
734
735
736
737
738
739
740
741
742
743
744
745
746
747
748
749
750
751
752
753
754
755
756
757
758
759
760
761
762
763
764
765
766
767
768
769
770
771
772
773
774
775
776
777
778
779
780
781
782
783
784
785
786
787
788
789
790
791
792
793
794
795
796
797
798
799
800
801
802
803
804
805
806
807
808
809
810
811
812
813
814
815
816
817
818
819
820
821
822
823
824
825
826
827
828
829
830
831
832
833
834
835
836
837
838
839
840
841
842
843
844
845
846
847
848
849
850
851
852
853
854
855
856
857
858
859
860
861
862
863
864
865
866
867
868
869
870
871
872
873
874
875
876
877
878
879
880
881
882
883
884
885
886
887
888
889
890
891
892
893
894
895
896
897
898
899
900
901
902
903
904
905
906
907
908
909
910
911
912
913
914
915
916
917
918
919
920
921
922
923
924
925
926
927
928
929
930
931
932
933
934
935
936
937
938
939
940
941
942
943
944
945
946
947
948
949
950
951
952
953
954
955
956
957
958
959
960
961
962
963
964
965
966
967
968
969
970
971
972
973
974
975
976
977
978
979
980
981
982
983
984
985
986
987
988
989
990
991
992
993
994
995
996
997
998
999
1000
# Logistic regression {#logreg}
As we saw in Chapters \@ref(simplin) and \@ref(multlin), linear regression assumes that the response variable $Y$ is *continuous*. In this chapter we will see how *logistic regression* can deal with a *discrete* response $Y$. The simplest case is with $Y$ being a *binary* response, that is, a variable encoding two categories. In general, we assume that we have $X_1,\ldots,X_k$ predictors for explaining $Y$ (*multiple* logistic regression) and cover the peculiarities for $k=1$ as particular cases.
## More `R` basics {#logreg-R}
In order to implement some of the contents of this chapter we need to cover more `R` basics, mostly related with flexible plotting that is not implemented directly in `R Commander`. The `R` functions we will are also very useful for simplifying some `R Commander` approaches.
In the following sections, **type** -- not copy and paste systematically -- the code in the `'R Script'` panel and send it to the output panel. Remember that you should get the same outputs (which are preceded by `## [1]`).
### Data frames revisited {#logreg-R-dataframes}
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# Let's begin importing the iris dataset
data(iris)
# names gives you the variables in the data frame
names(iris)
# The beginning of the data
head(iris)
# So we can access variables by $ or as in a matrix
iris$Sepal.Length[1:10]
iris[1:10, 1]
iris[3, 1]
# Information on the dimension of the data frame
dim(iris)
# str gives the structure of any object in R
str(iris)
# Recall the species variable: it is a categorical variable (or factor),
# not a numeric variable
iris$Species[1:10]
# Factors can only take certain values
levels(iris$Species)
# If a file contains a variable with character strings as observations (either
# encapsulated by quotation marks or not), the variable will become a factor
# when imported into R
```
```{block, type = 'rmdexercise'}
Do the following:
- Import [`auto.txt`](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/auto.txt) into `R` as the data frame `auto`. Check how the character strings in the file give rise to factor variables.
- Get the dimensions of `auto` and show beginning of the data.
- Retrieve the fifth observation of `horsepower` in two different ways.
- Compute the levels of `name`.
```
### Vector-related functions {#logreg-R-vector}
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# The function seq creates sequences of numbers equally separated
seq(0, 1, by = 0.1)
seq(0, 1, length.out = 5)
# You can short the latter argument
seq(0, 1, l = 5)
# Repeat number
rep(0, 5)
# Reverse a vector
myVec <- c(1:5, -1:3)
rev(myVec)
# Another way
myVec[length(myVec):1]
# Count repetitions in your data
table(iris$Sepal.Length)
table(iris$Species)
```
```{block, type = 'rmdexercise'}
Do the following:
- Create the vector $x=(0.3, 0.6, 0.9, 1.2)$.
- Create a vector of length 100 ranging from $0$ to $1$ with entries equally separated.
- Compute the amount of zeros and ones in `x <- c(0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)`. Check that they are the same as in `rev(x)`.
- Compute the vector $(0.1, 1.1, 2.1, ..., 100.1)$ in four different ways using `seq` and `rev`. Do the same but using `:` instead of `seq`. (Hint: add `0.1`)
```
### Logical conditions and subsetting {#logreg-R-logic}
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y
# They return TRUE or FALSE
# Smaller than
0 < 1
# Greater than
1 > 1
# Greater or equal to
1 >= 1 # Remember: ">="" and not "=>"" !
# Smaller or equal to
2 <= 1 # Remember: "<="" and not "=<"" !
# Equal
1 == 1 # Tests equality. Remember: "=="" and not "="" !
# Unequal
1 != 0 # Tests inequality
# TRUE is encoded as 1 and FALSE as 0
TRUE + 1
FALSE + 1
# In a vector-like fashion
x <- 1:5
y <- c(0, 3, 1, 5, 2)
x < y
x == y
x != y
# Subsetting of vectors
x
x[x >= 2]
x[x < 3]
# Easy way of work with parts of the data
data <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)
data
# Data such that x is zero
data0 <- data[data$x == 0, ]
data0
# Data such that x is larger than 2
data2 <- data[data$x > 2, ]
data2
# In an example
iris$Sepal.Width[iris$Sepal.Width > 3]
# Problem - what happened?
data[x > 2, ]
# In an example
summary(iris)
summary(iris[iris$Sepal.Width > 3, ])
# On the factor variable only makes sense == and !=
summary(iris[iris$Species == "setosa", ])
# Subset argument in lm
lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)
lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset
# Same thing for the subset field in R Commander's menus
# AND operator &
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE
# OR operator |
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE
# Both operators are useful for checking for ranges of data
y
index1 <- (y <= 3) & (y > 0)
y[index1]
index2 <- (y < 2) | (y > 4)
y[index2]
# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])
```
```{block, type = 'rmdexercise'}
Do the following for the `iris` dataset:
- Compute the subset corresponding to `Petal.Length` either smaller than `1.5` or larger than `2`. Save this dataset as `irisPetal`.
- Compute and summarize a linear regression of `Sepal.Width` into `Petal.Width + Petal.Length` for the dataset `irisPetal`. What is the $R^2$? (Solution: `0.101`)
- Check that the previous model is the same as regressing `Sepal.Width` into `Petal.Width + Petal.Length` for the dataset `iris` with the appropriate `subset` expression.
- Compute the variance for `Petal.Width` when `Petal.Width` is smaller or equal that `1.5` and larger than `0.3`. (Solution: `0.1266541`)
```
### Plotting functions {#logreg-R-plot}
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# plot is the main function for plotting in R
# It has a different behavior depending on the kind of object that it receives
# For example, for a regression model, it produces diagnostic plots
mod <- lm(Sepal.Width ~ Sepal.Length, data = iris)
plot(mod, 1)
# How to plot some data
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")
# Change the axis limits
plot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))
# How to plot a curve (a parabola)
x <- seq(-1, 1, l = 50)
y <- x^2
plot(x, y)
plot(x, y, main = "A dotted parabola")
plot(x, y, main = "A parabola", type = "l")
plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)
# Plotting a more complicated curve between -pi and pi
x <- seq(-pi, pi, l = 50)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l") # Kind of rough...
# More detailed plot
x <- seq(-pi, pi, l = 500)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l")
# Remember that we are joining points for creating a curve!
# For more options in the plot customization see
?plot
?par
# plot is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as points, lines or abline
plot(x, y) # Create a plot
lines(x, x^2, col = "red") # Add lines
points(x, y + 10, col = "blue") # Add points
abline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x
```
### Distributions {#logreg-R-distribs}
The operations on distributions described here are implemented in `R Commander` through the menu `'Distributions'`, but is convenient for you to grasp how are they working.
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# R allows to sample [r], compute density/probability mass [d],
# compute distribution function [p] and compute quantiles [q] for several
# continuous and discrete distributions. The format employed is [rdpq]name,
# where name stands for:
# - norm -> Normal
# - unif -> Uniform
# - exp -> Exponential
# - t -> Student's t
# - f -> Snedecor's F
# - chisq -> Chi squared
# - pois -> Poisson
# - binom -> Binomial
# More distributions:
?Distributions
# Sampling from a Normal - 100 random points from a N(0, 1)
rnorm(n = 10, mean = 0, sd = 1)
# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)
# Plotting the density of a N(0, 1) - the Gauss bell
x <- seq(-4, 4, l = 100)
y <- dnorm(x = x, mean = 0, sd = 1)
plot(x, y, type = "l")
# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
plot(x, y, type = "l")
# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)
# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)
# and qname(p,...), but the parameters in ... change. Look them in ?Distributions
# For example, here is the same for the uniform distribution
# Sampling from a U(0, 1)
set.seed(45678)
runif(n = 10, min = 0, max = 1)
# Plotting the density of a U(0, 1)
x <- seq(-2, 2, l = 100)
y <- dunif(x = x, min = 0, max = 1)
plot(x, y, type = "l")
# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)
# Sampling from a Bi(10, 0.5)
set.seed(45678)
samp <- rbinom(n = 200, size = 10, prob = 0.5)
table(samp) / 200
# Plotting the probability mass of a Bi(10, 0.5)
x <- 0:10
y <- dbinom(x = x, size = 10, prob = 0.5)
plot(x, y, type = "h") # Vertical bars
# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")
```
```{block, type = 'rmdexercise'}
Do the following:
- Compute the 90%, 95% and 99% quantiles of a $F$ distribution with `df1 = 1` and `df2 = 5`. (Answer: `c(4.060420, 6.607891, 16.258177)`)
- Plot the distribution function of a $U(0,1)$. Does it make sense with its density function?
- Sample 100 points from a Poisson with `lambda = 5`.
- Sample 100 points from a $U(-1,1)$ and compute its mean.
- Plot the density of a $t$ distribution with `df = 1` (use a sequence spanning from `-4` to `4`). Add lines of different colors with the densities for `df = 5`, `df = 10`, `df = 50` and `df = 100`. Do you see any pattern?
```
### Defining functions {#logreg-R-funcs}
```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# A function is a way of encapsulating a block of code so it can be reused easily
# They are useful for simplifying repetitive tasks and organize the analysis
# For example, in Setion 3.7 we had to make use of simpleAnova for computing
# the simple ANOVA table in multiple regression.
# This is a silly function that takes x and y and returns its sum
add <- function(x, y) {
x + y
}
# Calling add - you need to run the definition of the function first!
add(1, 1)
add(x = 1, y = 2)
# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
model <- lm(formula = formula, data = data)
summary(model)
}
# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)
# Recall: there is no variable called model in the workspace.
# The function works on its own workspace!
model
# Add a line to a plot
addLine <- function(x, beta0, beta1) {
lines(x, beta0 + beta1 * x, lwd = 2, col = 2)
}
# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)
```
## Examples and applications {#logreg-examps}
### Case study: *The Challenger disaster* {#logreg-examps-challenger}
The *Challenger* disaster occurred on the 28th January of 1986, when the NASA Space Shuttle orbiter *Challenger* broke apart and disintegrated at 73 seconds into its flight, leading to the deaths of its seven crew members. The accident deeply shocked the US society, in part due to the attention the mission had received because of the presence of Christa McAuliffe, who would have been the first astronaut-teacher. Because of this, NASA TV broadcasted live the launch to US public schools, which resulted in millions of school children witnessing the accident. The accident had serious consequences for the NASA credibility and resulted in an interruption of 32 months in the shuttle program. The Presidential *Rogers Commission* (formed by astronaut Neil A. Armstrong and Nobel laureate Richard P. Feynman, among others) was created to investigate the disaster.
```{r, video3, echo = FALSE, fig.align = 'center', fig.pos = 'h!', fig.cap = 'Challenger launch and posterior explosion, as broadcasted live by NBC in 28/01/1986.', screenshot.opts = list(delay = 15), dev = 'png', cache = TRUE}
knitr::include_url("https://www.youtube.com/embed/fSTrmJtHLFU")
```
The Rogers Commission elaborated a report [@Roberts1986] with all the findings. The commission determined that the disintegration began with the **failure of an O-ring seal in the solid rocket motor due to the unusual cold temperatures (-0.6 Celsius degrees)** during the launch. This failure produced a breach of burning gas through the solid rocket motor that compromised the whole shuttle structure, resulting in its disintegration due to the extreme aerodynamic forces. The **problematic with O-rings was something known**: the night before the launch, there was a three-hour teleconference between motor engineers and NASA management, discussing the effect of low temperature forecasted for the launch on the O-ring performance. The conclusion, influenced by Figure \@ref(fig:rogerts)a, was:
> "Temperature data [are] not conclusive on predicting primary O-ring blowby."
(ref:rogertstitle) Number of incidents in the O-rings (filed joints) versus temperatures. Panel *a* includes only flights with incidents. Panel *b* contains all flights (with and without incidents).
```{r, rogerts, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:rogertstitle)', fig.show = 'hold', cache = TRUE}
knitr::include_graphics("images/figures/challenger.png")
```
The Rogers Commission noted a major flaw in Figure \@ref(fig:rogerts)a: the **flights with zero incidents were excluded** from the plot because *it was felt* that **these flights did not contribute any information about the temperature effect** (Figure \@ref(fig:rogerts)b). The Rogers Commission concluded:
> "A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature".
The purpose of this case study, inspired by @Dalal1989, is to quantify what was the influence of the temperature in the probability of having at least one incident related with the O-rings. Specifically, we want to address the following questions:
- Q1. *Is the temperature associated with O-ring incidents?*
- Q2. *In which way was the temperature affecting the probability of O-ring incidents?*
- Q3. *What was the predicted probability of an incidient in an O-ring for the temperature of the launch day?*
To try to answer these questions we have the `challenger` dataset ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/challenger.txt)). The dataset contains (shown in Table \@ref(tab:challengertable)) information regarding the state of the solid rocket boosters after launch^[After the shuttle exits the atmosphere, the solid rocket boosters separate and descend to land using a parachute where they are carefully analyzed.] for 23 flights. Each row has, among others, the following variables:
- `fail.field`, `fail.nozzle`: binary variables indicating whether there was an incident with the O-rings in the field joints or in the nozzles of the solid rocket boosters. `1` codifies an incident and `0` its absence. On the analysis, we focus on the O-rings of the field joint as being the most determinants for the accident.
- `temp`: temperature in the day of launch. Measured in Celsius degrees.
- `pres.field`, `pres.nozzle`: leak-check pressure tests of the O-rings. These tests assured that the rings would seal the joint.
(ref:challengertabletitle) The `challenger` dataset.
```{r, challengertable, echo = FALSE, out.width = '90%', fig.align = 'center', fig.pos = 'h!', cache = TRUE}
challenger <- read.table(file = "datasets/challenger.txt", header = TRUE, sep = "\t")
knitr::kable(
challenger[, c(1, 2, 5:7)],
booktabs = TRUE,
longtable = TRUE,
caption = '(ref:challengertabletitle)'
)
```
Let's begin the analysis by replicating Figures \@ref(fig:rogerts)a and \@ref(fig:rogerts)b and checking that linear regression is not the right tool for answering Q1--Q3. For that, we make two scatterplots of `nfails.field` (number of total incidents in the field joints) versus `temp`, the first one excluding the launches without incidents (`subset = nfails.field > 0`) and the second one for all the data. Doing it through `R Commander` as we saw in Chapter \@ref(simplin), you should get something similar to:
```{r, echo = FALSE, message = FALSE, warning = FALSE, cache = TRUE}
library(car)
```
```{r, collapse = TRUE, cache = TRUE}
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,
boxplots = FALSE, data = challenger, subset = nfails.field > 0)
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,
boxplots = FALSE, data = challenger)
```
There is a fundamental problem in using linear regression for this data: **the response is not continuous**. As a consequence, there is no linearity and the errors around the mean are not normal (indeed, they are strongly non normal). Let's check this with the corresponding diagnostic plots:
```{r, collapse = TRUE, cache = TRUE}
mod <- lm(nfails.field ~ temp, data = challenger)
par(mfrow = 1:2)
plot(mod, 1)
plot(mod, 2)
```
Albeit linear regression is not the adequate tool for this data, it is able to detect the obvious difference between the two plots:
1. **The trend for launches with incidents is flat, hence suggesting there is no dependence on the temperature** (Figure \@ref(fig:rogerts)a). This was one of the arguments behind NASA's decision of launching the rocket at a temperature of -0.6 degrees.
2. However, **the trend for *all* launches indicates a clear negative dependence between temperature and number of incidents!** (Figure \@ref(fig:rogerts)b). Think about it in this way: the minimum temperature for a launch without incidents ever recorded was above 18 degrees, and the Challenger was launched at -0.6 without clearly knowing the effects of such low temperatures.
Instead of trying to predict the number of incidents, we will concentrate on modeling the *probability of expecting at least one incident given the temperature*, a simpler but also revealing approach. In other words, we look to estimate the following curve:
$$
p(x)=\mathbb{P}(\text{incident}=1|\text{temperature}=x)
$$
from `fail.field` and `temp`. This probability can not be properly modeled as a linear function like $\beta_0+\beta_1x$, since inevitably will fall outside $[0,1]$ for some value of $x$ (some will have negative probabilities or probabilities larger than one). The technique that solves this problem is the **logistic regression**. The idea behind is quite simple: transform a linear model $\beta_0+\beta_1x$ -- which is aimed for a response in $\mathbb{R}$ -- so that it yields a value in $[0,1]$. This is achieved by the *logistic function*
\begin{align}
\text{logistic}(t)=\frac{e^t}{1+e^t}=\frac{1}{1+e^{-t}}.(\#eq:log)
\end{align}
The logistic model in this case is
$$
\mathbb{P}(\text{incident}=1|\text{temperature}=x)=\text{logistic}\left(\beta_0+\beta_1x\right)=\frac{1}{1+e^{-(\beta_0+\beta_1x)}},
$$
with $\beta_0$ and $\beta_1$ unknown. Let's fit the model to the data by estimating $\beta_0$ and $\beta_1$.
In order to fit a logistic regression to the data, go to `'Statistics' -> 'Fit models' -> 'Generalized linear model...'`. A window like Figure \@ref(fig:glm) will pop-up, which you should fill as indicated.
```{r, glm, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = 'Window for performing logistic regression.', cache = TRUE}
knitr::include_graphics("images/screenshots/glm.png")
```
A code like this will be generated:
```{r, collapse = TRUE, cache = TRUE}
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summary(nasa)
exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")
```
The summary of the logistic model is notably different from the linear regression, as the methodology behind is quite different. Nevertheless, we have tests for the significance of each coefficient. Here we obtain that `temp` is significantly different from zero, at least at a level $\alpha=0.05$. Therefore we can conclude that **the temperature is indeed affecting the probability of an incident with the O-rings** (answers Q1).
The precise interpretation of the coefficients will be given in the next section. For now, the coefficient of `temp`, $\hat\beta_1$, can be regarded the "correlation between the temperature and the probability of having at least one incident". This correlation, as evidenced by the sign of $\hat\beta_1$, is negative. Let's plot the fitted logistic curve to see that indeed the probability of incident and temperature are negatively correlated:
```{r, collapse = TRUE, cache = TRUE}
# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",
ylab = "Incident probability")
# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)
# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)
```
At the sight of this curve and the summary of the model we can conclude that **the temperature was increasing the probability of an O-ring incident** (Q2). Indeed, the confidence intervals for the coefficients show a significative negative correlation at level $\alpha=0.05$:
```{r, collapse = TRUE, message = FALSE, cache = TRUE}
confint(nasa, level = 0.95)
```
Finally, **the probability of having at least one incident with the O-rings in the launch day was $0.9996$ according to the fitted logistic model** (Q3). This is easily obtained:
```{r, collapse = TRUE, cache = TRUE}
predict(nasa, newdata = data.frame(temp = -0.6), type = "response")
```
Be aware that `type = "response"` has a different meaning in logistic regression. As you can see it does not return a CI for the prediction as in linear models. Instead, `type = "response"` means that the *probability* should be returned, instead of the value of the link function, which is returned with `type = "link"` (the default).
Recall that there is a serious problem of **extrapolation** in the prediction, which makes it less precise (or more variable). But this extrapolation, together with the evidences raised by a simple analysis like we did, should have been strong arguments for postponing the launch.
To conclude this section, we refer to a funny and comprehensive exposition by Juan Cuesta (University of Cantabria) on the flawed statistical analysis that contributed to the Challenger disaster.
```{r, video4, echo = FALSE, fig.align = 'center', fig.pos = 'h!', fig.cap = 'The Challenger disaster and other elegant applications of statistics in complex problems.', screenshot.opts = list(delay = 15), dev = 'png', cache = TRUE}
knitr::include_url("https://www.youtube.com/embed/MOeDbQlF5Tw?start=552")
```
## Model formulation and estimation by maximum likelihood {#logreg-model}
As we saw in Section \@ref(multlin-model), the multiple linear model described the relation between the random variables $X_1,\ldots,X_k$ and $Y$ by assuming the linear relation
\begin{align*}
Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k + \varepsilon.
\end{align*}
Since we assume $\mathbb{E}[\varepsilon|X_1=x_1,\ldots,X_k=x_k]=0$, the previous equation was equivalent to
\begin{align}
\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k,(\#eq:eq-lm)
\end{align}
where $\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]$ is the mean of $Y$ for a particular value of the set of predictors. As remarked in Section \@ref(multlin-assumps), it was a *necessary condition that $Y$ was continuous in order to satisfy the normality of the errors*, hence the linear model assumptions. Or in other words, **the linear model is designed for a continuous response**.
The situation when $Y$ is *discrete* (naturally ordered values; e.g. number of fails, number of students) or *categorical* (non-ordered categories; e.g. territorial divisions, ethnic groups) requires a special treatment. The simplest situation is when $Y$ is *binary* (or *dichotomic*): it can only take two values, codified for convenience as $1$ (success) and $0$ (failure). For example, in the Challenger case study we used `fail.field` as an *indicator* of whether "there was at least an incident with the O-rings" (`1` = yes, `0` = no). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables.
More formally, a binary variable is known as a *Bernoulli variable*, which is the simplest non-trivial random variable. We say that $Y\sim\mathrm{Ber}(p)$, $0\leq p\leq1$, if
\[
Y=\left\{\begin{array}{ll}1,&\text{with probability }p,\\0,&\text{with probability }1-p,\end{array}\right.
\]
or, equivalently, if $\mathbb{P}[Y=1]=p$ and $\mathbb{P}[Y=0]=1-p$, which can be written compactly as
\begin{align}
\mathbb{P}[Y=y]=p^y(1-p)^{1-y},\quad y=0,1.(\#eq:ber)
\end{align}
Recall that a *binomial variable with size $n$ and probability $p$*, $\mathrm{Bi}(n,p)$, is obtained by summing $n$ independent $\mathrm{Ber}(p)$ (so $\mathrm{Ber}(p)$ is the same as $\mathrm{Bi}(1,p)$). This is why we need to use a `family = "binomial"` in `glm`, to indicate that the response is binomial.
```{block, type = 'rmdinsight'}
A Bernoulli variable $Y$ is completely determined by the probability $p$. **So do its mean and variance**:
- $\mathbb{E}[Y]=p\times1+(1-p)\times0=p$
- $\mathbb{V}\mathrm{ar}[Y]=p(1-p)$
In particular, recall that $\mathbb{P}[Y=1]=\mathbb{E}[Y]=p$.
This is something relatively uncommon (on a $\mathcal{N}(\mu,\sigma^2)$, $\mu$ determines the mean and $\sigma^2$ the variance) that has important consequences for the logistic model: we do not need a $\sigma^2$.
```
```{block, type = 'rmdexercise'}
Are these Bernoulli variables? If so, which is the value of $p$ and what could the codes $0$ and $1$ represent?
- The toss of a fair coin.
- A variable with mean $p$ and variance $p(1-p)$.
- The roll of a dice.
- A binary variable with mean $0.5$ and variance $0.45$.
- The winner of an election with two candidates.
```
Assume then that $Y$ is a binary/Bernoulli variable and that $X_1,\ldots,X_k$ are predictors associated to $Y$ (no particular assumptions on them). The purpose in *logistic regression* is to estimate
\[
p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k],
\]
this is, how the probability of $Y=1$ is changing according to particular values, denoted by $x_1,\ldots,x_k$, of the random variables $X_1,\ldots,X_k$. $p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]$ stands for the *conditional probability of $Y=1$ given $X_1,\ldots,X_k$*. At sight of \@ref(eq:eq-lm), a tempting possibility is to consider the model
\[
p(x_1,\ldots,x_k)=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.
\]
However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one. A solution is to consider a function to encapsulate the value of $z=\beta_0+\beta_1x_1+\ldots+\beta_kx_k$, in $\mathbb{R}$, and map it back to $[0,1]$. There are several alternatives to do so, based on distribution functions $F:\mathbb{R}\longrightarrow[0,1]$ that deliver $y=F(z)\in[0,1]$ (see Figure \@ref(fig:logitprobit)). Different choices of $F$ give rise to different models:
- **Uniform**. Truncate $z$ to $0$ and $1$ when $z<0$ and $z>1$, respectively.
- **Logit**. Consider the *logistic distribution function*:
\[
F(z)=\mathrm{logistic}(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}.
\]
- **Probit**. Consider the *normal* distribution function, this is, $F=\Phi$.
(ref:logitprobittitle) Different transformations mapping the response of a simple linear regression $z=\beta_0+\beta_1x$ to $[0,1]$.
```{r, logitprobit, echo = FALSE, results = 'hide', out.width = '70%', fig.show = 'hold', fig.asp = 1, fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:logitprobittitle)', cache = TRUE}
# Regression
x <- seq(-3, 3, l = 200)
z <- 2 * x
# Truncation
z1 <- z
z1[z1 > 1] <- 1
z1[z1 < 0] <- 0
# Logit
z2 <- 1 / (1 + exp(-z))
# Probit
z3 <- pnorm(z)
# Comparison
plot(x, z, type = "l", ylim = c(-0.1, 1.1), ylab = "Probability", lwd = 2)
lines(x, z1, col = "blue", lwd = 2)
lines(x, z2, col = "red", lwd = 2)
lines(x, z3, col = "green", lwd = 2)
legend("topleft", legend = c("Linear regression", "Uniform", "Logit", "Probit"), lwd = 2,
col = c("black", "blue", "red", "green"))
```
The **logistic** transformation is the most employed due to its **tractability, interpretability and smoothness**. Its inverse, $F^{-1}:[0,1]\longrightarrow\mathbb{R}$, known as the *logit function*, is
\[
\mathrm{logit}(p)=\mathrm{logistic}^{-1}(p)=\log\frac{p}{1-p}.
\]
This is a *link function*, this is, a function that maps a given space (in this case $[0,1]$) into $\mathbb{R}$. The term link function is employed in *generalized linear models*, which follow exactly the same philosophy of the logistic regression -- mapping the domain of $Y$ to $\mathbb{R}$ in order to apply there a linear model. We will concentrate here exclusively on the logit as a link function. Therefore, the *logistic model* is
\begin{align}
p(x_1,\ldots,x_k)=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}.(\#eq:eq-log)
\end{align}
The linear form inside the exponent has a clear interpretation:
- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k=0$, then $p(x_1,\ldots,x_k)=\frac{1}{2}$ ($Y=1$ and $Y=0$ are equally likely).
- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k<0$, then $p(x_1,\ldots,x_k)<\frac{1}{2}$ ($Y=1$ less likely).
- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k>0$, then $p(x_1,\ldots,x_k)>\frac{1}{2}$ ($Y=1$ more likely).
To be more precise on the interpretation of the coefficients $\beta_0,\ldots,\beta_k$ we need to introduce the *odds*. The **odds is an equivalent way of expressing the distribution of probabilities in a binary variable**. Since $\mathbb{P}[Y=1]=p$ and $\mathbb{P}[Y=0]=1-p$, both the success and failure probabilities can be inferred from $p$. Instead of using $p$ to characterize the distribution of $Y$, we can use
\begin{align}
\mathrm{odds}(Y)=\frac{p}{1-p}=\frac{\mathbb{P}[Y=1]}{\mathbb{P}[Y=0]}.(\#eq:eq-odds)
\end{align}
The odds is the *ratio between the probability of success and the probability of failure*. It is extensively used in betting^[Recall that the result of a bet is binary: you either win or lose the bet.] due to its better interpretability. For example, if a horse $Y$ has a probability $p=2/3$ of winning a race ($Y=1$), then the odds of the horse is
\[
\text{odds}=\frac{p}{1-p}=\frac{2/3}{1/3}=2.
\]
This means that the horse has a *probability of winning that is twice larger than the probability of losing*. This is sometimes written as a $2:1$ or $2 \times 1$ (spelled "two-to-one"). Conversely, if the odds of $Y$ is given, we can easily know what is the probability of success $p$, using the inverse of \@ref(eq:eq-odds):
\[
p=\mathbb{P}[Y=1]=\frac{\text{odds}(Y)}{1+\text{odds}(Y)}.
\]
For example, if the odds of the horse were $5$, that would correspond to a probability of winning $p=5/6$.
```{block, type ='rmdinsight'}
Recall that the odds is a number in $[0,+\infty]$. The $0$ and $+\infty$ values are attained for $p=0$ and $p=1$, respectively. The log-odds (or logit) is a number in $[-\infty,+\infty]$.
```
We can rewrite \@ref(eq:eq-log) in terms of the odds \@ref(eq:eq-odds). If we do so, we have:
\begin{align}
\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k)&=\frac{p(x_1,\ldots,x_k)}{1-p(x_1,\ldots,x_k)}\nonumber\\
&=e^{\beta_0+\beta_1x_1+\ldots+\beta_kx_k}\nonumber\\
&=e^{\beta_0}e^{\beta_1x_1}\ldots e^{\beta_kx_k}(\#eq:eq-odds-log1)
\end{align}
or, taking logarithms, the *log-odds* (or logit)
\begin{align}
\log(\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k))=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.(\#eq:eq-odds-log2)
\end{align}
The conditional log-odds \@ref(eq:eq-odds-log2) plays here the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:
- $\beta_0$: is the log-odds when $X_1=\ldots=X_k=0$.
- $\beta_j$, $1\leq j\leq k$: is the **additive** increment of the log-odds for an increment of one unit in $X_j=x_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ *do not change*.
The log-odds is not so easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on \@ref(eq:eq-odds-log1), is:
- $e^{\beta_0}$: is the odds when $X_1=\ldots=X_k=0$.
- $e^{\beta_j}$, $1\leq j\leq k$: is the **multiplicative** increment of the odds for an increment of one unit in $X_j=x_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ *do not change*. If the increment in $X_j$ is of $r$ units, then the multiplicative increment in the odds is $(e^{\beta_j})^r$.
As a consequence of this last interpretation, we have:
```{block2, type = 'rmdinsight'}
If $\beta_j>0$ (respectively, $\beta_j<0$) then $e^{\beta_j}>1$ ($e^{\beta_j}<1$) in \@ref(eq:eq-odds-log1). Therefore, an increment of one unit in $X_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ do not change, results in an increment (decrement) of the odds, this is, in an increment (decrement) of $\mathbb{P}[Y=1]$.
```
```{block, type = 'rmdcaution'}
Since the relationship between $p(X_1,\ldots,X_k)$ and $X_1,\ldots,X_k$ is not linear, **$\beta_j$ does not correspond to the change in $p(X_1,\ldots,X_k)$ associated with a one-unit increase in $X_j$**.
```
Let's visualize this concepts quickly with the output of the Challenger case study:
```{r, collapse = TRUE, cache = TRUE}
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summary(nasa)
exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")
# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",
ylab = "Incident probability")
# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)
# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)
```
The exponentials of the estimated coefficients are:
- $e^{\hat\beta_0}=1965.974$. This means that, *when the temperature is zero*, the fitted odds is $1965.974$, so the probability of having an incident ($Y=1$) is $1965.974$ times larger than the probability of not having an incident ($Y=0$). In other words, the probability of having an incident at temperature zero is $\frac{1965.974}{1965.974+1}=0.999$.
- $e^{\hat\beta_1}=0.659$. This means that each Celsius degree increment in the temperature multiplies the fitted odds by a factor of $0.659\approx\frac{2}{3}$, hence reducing it.
The estimation of $\boldsymbol{\beta}=(\beta_0,\beta_1,\ldots,\beta_k)$ from a sample $(\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)$ is different than in linear regression. It is not based on minimizing the RSS but on the principle of *Maximum Likelihood Estimation* (MLE). MLE is based on the following *leitmotiv*: **what are the coefficients $\boldsymbol{\beta}$ that make the sample more likely?** Or in other words, **what coefficients make the model more probable, based on the sample**. Since $Y_i\sim \mathrm{Ber}(p(\mathbf{X}_i))$, $i=1,\ldots,n$, the likelihood of $\boldsymbol{\beta}$ is
\begin{align}
\text{lik}(\boldsymbol{\beta})=\prod_{i=1}^np(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}.(\#eq:eq-lik)
\end{align}
$\text{lik}(\boldsymbol{\beta})$ is the **probability of the data based on the model**. Therefore, it is a number between $0$ and $1$. Its detailed interpretation is the following:
- $\prod_{i=1}^n$ appears because the sample elements are assumed to be independent and we are computing the probability of observing the whole sample $(\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)$. This probability is equal to the product of the *probabilities of observing each $(\mathbf{X}_{i},Y_i)$*.
- $p(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}$ is the probability of observing $(\mathbf{X}_{i},Y_i)$, as given by \@ref(eq:ber). Remember that $p$ depends on $\boldsymbol{\beta}$ due to \@ref(eq:eq-log).
Usually, the log-likelihood is considered instead of the likelihood for stability reasons -- the estimates obtained are exactly the same and are
\[
\hat{\boldsymbol{\beta}}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}}\log \text{lik}(\boldsymbol{\beta}).
\]
Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for $\hat{\boldsymbol{\beta}}$. These have to be obtained numerically by means of an iterative procedure (the number of iterations required is printed in the output of `summary`). In low sample situations with perfect classification, the iterative procedure may not converge.
Figure \@ref(fig:maximumlikelihood) shows how the log-likelihood changes with respect to the values for $(\beta_0,\beta_1)$ in three data patterns.
(ref:maximumlikelihoodtitle) The logistic regression fit and its dependence on $\beta_0$ (horizontal displacement) and $\beta_1$ (steepness of the curve). Recall the effect of the sign of $\beta_1$ in the curve: if positive, the logistic curve has an 's' form; if negative, the form is a reflected 's'. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-maximum-likelihood/).
```{r, maximumlikelihood, echo = FALSE, fig.cap = '(ref:maximumlikelihoodtitle)', screenshot.alt = "images/screenshots/log-maximum-likelihood.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}
knitr::include_app('https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-maximum-likelihood/', height = '900px')
```
The data of the illustration has been generated with the following code:
```{r, cache = TRUE}
# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))
# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
```
Let's check that indeed the coefficients given by `glm` are the ones that maximize the likelihood given in the animation of Figure \@ref(fig:maximumlikelihood). We do so for `y ~ x1`.
```{r, echo = TRUE, collapse = TRUE, cache = TRUE}
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
```
```{block2, type = 'rmdexercise'}
For the regressions `y ~ x2` and `y ~ x3`, do the following:
- Check that the true $\boldsymbol{\beta}$ is close to maximizing the likelihood computed in Figure \@ref(fig:maximumlikelihood).
- Plot the fitted logistic curve and compare it with the one in Figure \@ref(fig:maximumlikelihood).
```
```{block, type = 'rmdinsight'}
In linear regression we relied on least squares estimation, in other words, the minimization of the RSS. *Why do we need MLE in logistic regression and not least squares?* The answer is two-fold:
1. **MLE is asymptotically optimal** when estimating unknown parameters in a model. That means that when the sample size $n$ is large, it is **guaranteed to perform better than any other estimation method**. Therefore, considering a least squares approach for logistic regression will result in suboptimal estimates.
2. In **multiple linear regression**, due to the *normality* assumption, **MLE and least squares estimation coincide**. So MLE is hidden under the form of the least squares, which is a more intuitive estimation procedure. Indeed, the maximized likelihood $\text{lik}(\hat{\boldsymbol{\beta}})$ in the linear model and the RSS are intimately related.
```
```{block, type = 'rmdinsight'}
As in the linear model, **the inclusion of a new predictor changes the coefficient estimates of the logistic model**.
```
## Assumptions of the model {#logreg-assumps}
Some probabilistic assumptions are required for performing inference on the model parameters $\boldsymbol{\beta}$ from the sample $(\mathbf{X}_1, Y_1),\ldots,(\mathbf{X}_n, Y_n)$. These assumptions are somehow simpler than the ones for linear regression.
```{r, logisticmodel, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = 'The key concepts of the logistic model.', cache = TRUE}
knitr::include_graphics("images/R/logisticmodel.png")
```
The assumptions of the logistic model are the following:
i. **Linearity in the logit**^[An equivalent way of stating this assumption is $p(\mathbf{x})=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)$.]: $\mathrm{logit}(p(\mathbf{x}))=\log\frac{
p(\mathbf{x})}{1-p(\mathbf{x})}=\beta_0+\beta_1x_1+\ldots+\beta_kx_k$.
ii. **Binariness**: $Y_1,\ldots,Y_n$ are binary variables.
iii. **Independence**: $Y_1,\ldots,Y_n$ are independent.
A good one-line summary of the logistic model is the following (independence is assumed)
\begin{align}
Y|(X_1=x_1,\ldots,X_k=x_k)&\sim\mathrm{Ber}\left(\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)\right)\nonumber\\
&=\mathrm{Ber}\left(\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}\right).(\#eq:condber)
\end{align}
There are three important points of the linear model assumptions missing in the ones for the logistic model:
- *Why is homoscedasticity not required?* As seen in the previous section, Bernoulli variables are determined only by the probability of success, in this case $p(\mathbf{x})$. That determines also the variance, which is variable, so **there is heteroskedasticity**. In the linear model, we have to control $\sigma^2$ explicitly due to the higher flexibility of the normal.
- *Where are the errors? The errors played a fundamental role in the linear model assumptions, but are not employed in logistic regression.* The errors are not fundamental for building the linear model but just a helpful concept related to least squares. The linear model can be constructed without errors as \@ref(eq:condnorm), which has a logistic analogous in \@ref(eq:condber).
- *Why is normality not present?* A normal distribution is not adequate to replace the Bernoulli distribution in \@ref(eq:condber) since the response $Y$ *has to be binary* and the Normal or other continuous distribution would put yield illegal values for $Y$.
```{block, type = 'rmdinsight'}
Recall that:
- Nothing is said about the distribution of $X_1,\ldots,X_k$. They could be deterministic or random. They could be discrete or continuous.
- $X_1,\ldots,X_k$ are **not required to be independent** between them.
```
## Inference for model parameters {#logreg-inference}
The assumptions on which the logistic model is constructed allow to specify what is the *asymptotic* distribution of the *random vector* $\hat{\boldsymbol{\beta}}$. Again, the distribution is derived conditionally on the sample predictors $\mathbf{X}_1,\ldots,\mathbf{X}_n$. In other words, we assume that the randomness of $Y$ comes only from $Y|(X_1=x_1,\ldots,X_k=x_k)\sim\mathrm{Ber}(p(\mathbf{x}))$ and not from the predictors. To denote this, we employ lowercase for the sample predictors $\mathbf{x}_1,\ldots,\mathbf{x}_n$.
There is an important difference between the inference results for the linear model and for logistic regression:
- **In linear regression the inference is exact**. This is due to the nice properties of the normal, least squares estimation and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
- **In logistic regression the inference is asymptotic**. This means that the distributions of the coefficients are unknown except for large sample sizes $n$, for which we have **approximations**. The reason is the more complexity of the model in terms of non-linearity. This is the usual situation for the majority of the regression models.
### Distributions of the fitted coefficients {#logreg-inference-distribs}
The distribution of $\hat{\boldsymbol{\beta}}$ is given by the asymptotic theory of MLE:
\begin{align}
\hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol{\beta},I(\boldsymbol{\beta})^{-1}\right)
(\#eq:mle1)
\end{align}
where $\sim$ must be understood as *approximately distributed as [...] when $n\to\infty$* for the rest of this chapter. $I(\boldsymbol{\beta})$ is known as the *Fisher information matrix*, and receives that name because *it measures the information available in the sample for estimating $\boldsymbol{\beta}$*. Therefore, the *larger* the matrix is, the more precise is the estimation of $\boldsymbol{\beta}$, because that results in smaller variances in \@ref(eq:mle1). The inverse of the Fisher information matrix is
\begin{align}
I(\boldsymbol{\beta})^{-1}=(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1},
(\#eq:mle2)
\end{align}
where $\mathbf{V}$ is a diagonal matrix containing the *different* variances for each $Y_i$ (remember that $p(\mathbf{x})=1/(1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)})$):
\[
\mathbf{V}=\begin{pmatrix}
p(\mathbf{X}_1)(1-p(\mathbf{X}_1)) & & &\\
& p(\mathbf{X}_2)(1-p(\mathbf{X}_2)) & & \\
& & \ddots & \\
& & & p(\mathbf{X}_n)(1-p(\mathbf{X}_n))
\end{pmatrix}
\]
In the case of the multiple linear regression, $I(\boldsymbol{\beta})^{-1}=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ (see \@ref(eq:normp)), so the presence of $\mathbf{V}$ here is revealing the heteroskedasticity of the model.
The interpretation of \@ref(eq:mle1) and \@ref(eq:mle2) gives some useful insights on what concepts affect the quality of the estimation:
- **Bias**. The estimates are *asymptotically* unbiased.
- **Variance**. It depends on:
- *Sample size $n$*. Hidden inside $\mathbf{X}^T\mathbf{V}\mathbf{X}$. As $n$ grows, the precision of the estimators increases.
- *Weighted predictor sparsity $(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}$*. The more *sparse* the predictor is (small $|(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}|$), the more precise $\hat{\boldsymbol{\beta}}$ is.
```{block2, type = 'rmdinsight'}
**The precision of $\hat{\boldsymbol{\beta}}$ is affected by the value of $\boldsymbol{\beta}$**, which is hidden inside $\mathbf{V}$. This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by the value of the unknown coefficients (see \@ref(eq:normp)). The reason is partially due to the **heteroskedasticity** of logistic regression, which implies a dependence of the variance of $Y$ in the logistic curve, hence in $\boldsymbol{\beta}$.
```
(ref:lograndomcoefstitle) Illustration of the randomness of the fitted coefficients $(\hat\beta_0,\hat\beta_1)$ and the influence of $n$, $(\beta_0,\beta_1)$ and $s_x^2$. The sample predictors $x_1,\ldots,x_n$ are fixed and new responses $Y_1,\ldots,Y_n$ are generated each time from a logistic model $Y|X=x\sim\mathrm{Ber}(p(X))$. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-random/).
```{r, lograndomcoefs, echo = FALSE, fig.cap = '(ref:lograndomcoefstitle)', screenshot.alt = "images/screenshots/log-random.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}
knitr::include_app('https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-random/', height = '1000px')
```
Similar to linear regression, the problem with \@ref(eq:mle1) and \@ref(eq:mle2) is that *$\mathbf{V}$ is unknown* in practice because it depends on $\boldsymbol{\beta}$. Plugging-in the estimate $\hat{\boldsymbol{\beta}}$ to $\boldsymbol{\beta}$ in $\mathbf{V}$ results in $\hat{\mathbf{V}}$. Now we can use $\hat{\mathbf{V}}$ to get
\begin{align}
\frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim \mathcal{N}(0,1),\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=v_j^2(\#eq:mle3)
\end{align}
where
\[
v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}^T\hat{\mathbf{V}}\mathbf{X})^{-1}.
\]
The LHS of \@ref(eq:normp2) is the **Wald statistic** for $\beta_j$, $j=0,\ldots,k$. They are employed for building confidence intervals and hypothesis tests.
### Confidence intervals for the coefficients {#logreg-inference-cis}
Thanks to \@ref(eq:mle3), we can have the $100(1-\alpha)\%$ CI for the coefficient $\beta_j$, $j=0,\ldots,k$:
\begin{align}
\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)(\#eq:ciplog)
\end{align}
where $z_{\alpha/2}$ is the *$\alpha/2$-upper quantile of the $\mathcal{N}(0,1)$*. In case we are interested in the CI for $e^{\beta_j}$, we can just simply take the exponential on the above CI. So the $100(1-\alpha)\%$ CI for $e^{\beta_j}$, $j=0,\ldots,k$, is
\[
e^{\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)}.
\]
Of course, this CI is **not** the same as $\left(e^{\hat\beta_j}\pm e^{\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\right)$, which is not a CI for $e^{\hat\beta_j}$.
Let's see how we can compute the CIs. We return to the `challenger` dataset, so in case you do not have it loaded, you can download it [here](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/challenger.txt). We analyze the CI for the coefficients of `fail.field ~ temp`.
```{r, collapse = TRUE, cache = TRUE}
# Fit model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
# Confidence intervals at 95%
confint(nasa)
# Confidence intervals at other levels
confint(nasa, level = 0.90)
# Confidence intervals for the factors affecting the odds
exp(confint(nasa))
```
In this example, the 95% confidence interval for $\beta_0$ is $(1.3364, 17.7834)$ and for $\beta_1$ is $(-0.9238, -0.1090)$. For $e^{\beta_0}$ and $e^{\beta_1}$, the CIs are $(3.8053, 5.2874\times10^7)$ and $(0.3070, 0.8967)$, respectively. Therefore, we can say with a 95% confidence that:
- When `temp`=0, the probability of `fail.field`=1 is significantly lager than the probability of `fail.field`=0 (using the CI for $\beta_0$). Indeed, `fail.field`=1 is between $3.8053$ and $5.2874\times10^7$ more likely than `fail.field`=0 (using the CI for $e^{\beta_0}$).
- `temp` has a significantly negative effect in the probability of `fail.field`=1 (using the CI for $\beta_1$). Indeed, each unit increase in `temp` produces a reduction of the odds of `fail.field` by a factor between $0.3070$ and $0.8967$ (using the CI for $e^{\beta_1}$).
```{block2, type = 'rmdexercise'}
Compute and interpret the CIs for the exponentiated coefficients, at level $\alpha=0.05$, for the following regressions (`challenger` dataset):
- `fail.field ~ temp + pres.field`
- `fail.nozzle ~ temp + pres.nozzle`
- `fail.field ~ temp + pres.nozzle`
- `fail.nozzle ~ temp + pres.field`
The interpretation of the variables is given above Table \@ref(tab:challengertable).
```
### Testing on the coefficients {#logreg-inference-tests}
The distributions in \@ref(eq:mle3) also allow to conduct a formal hypothesis test on the coefficients $\beta_j$, $j=0,\ldots,k$. For example, the test for significance:
\begin{align*}
H_0:\beta_j=0
\end{align*}
for $j=0,\ldots,k$. The test of $H_0:\beta_j=0$ with $1\leq j\leq k$ is especially interesting, since it allows to answer whether *the variable $X_j$ has a significant effect on $\mathbb{P}[Y=1]$*. The statistic used for testing for significance is the Wald statistic
\begin{align*}
\frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)},
\end{align*}
which is asymptotically distributed as a $\mathcal{N}(0,1)$ *under the (veracity of) the null hypothesis*. $H_0$ is tested *against* the *bilateral* alternative hypothesis $H_1:\beta_j\neq 0$.
The tests for significance are built-in in the `summary` function. However, a note of caution is required when applying the rule of thumb:
> Is the CI for $\beta_j$ below (above) $0$ at level $\alpha$?
>
>- Yes $\rightarrow$ reject $H_0$ at level $\alpha$.
>- No $\rightarrow$ the criterion is not conclusive.
```{block, type = 'rmdcaution'}
The significances given in `summary` and the output of `confint` are *slightly* incoherent and the previous rule of thumb **does not apply**. The reason is because `MASS`'s `confint` is using a more sophisticated method (profile likelihood) to estimate the standard error of $\hat\beta_j$, $\hat{\mathrm{SE}}(\hat\beta_j)$, and not the asymptotic distribution behind Wald statistic.
By changing `confint` to `R`'s default `confint.default`, the results of the latter will be completely equivalent to the significances in `summary`, and the rule of thumb still be completely valid. For the contents of this course we prefer `confint.default` due to its better interpretability.
```
To illustrate this we consider the regression of `fail.field ~ temp + pres.field`:
```{r, collapse = TRUE, cache = TRUE}
# Significances with asymptotic approximation for the standard errors
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
data = challenger)
summary(nasa2)
# CIs with asymptotic approximation - coherent with summary
confint.default(nasa2, level = 0.90)
confint.default(nasa2, level = 0.99)
# CIs with profile likelihood - incoherent with summary
confint(nasa2, level = 0.90) # intercept still significant
confint(nasa2, level = 0.99) # temp still significant
```
```{block, type = 'rmdexercise'}
For the previous exercise, check the differences of using `confint` or `confint.default` for computing the CIs.
```
## Prediction {#logreg-prediction}
*Prediction* in logistic regression focuses mainly on predicting the values of the logistic curve
\[
p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}
\]
by means of
\[
\hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k)}}.
\]
From the perspective of the linear model, this is the same as predicting the **conditional mean** (not the conditional response) of the response, but this time this conditional mean is also a **conditional probability**. The prediction of the conditional response is not so interesting since it follows immediately from $\hat p(x_1,\ldots,x_k)$:
\[
\hat{Y}|(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1-\hat p(x_1,\ldots,x_k).\end{array}\right.
\]
As a consequence, we can predict $Y$ as $1$ if $\hat p(x_1,\ldots,x_k)>\frac{1}{2}$ and as $0$ if $\hat p(x_1,\ldots,x_k)<\frac{1}{2}$.
Let's focus then on how to make predictions and compute CIs in practice with `predict`. Similarly to the linear model, the objects required for `predict` are: first, the output of `glm`; second, a `data.frame` containing the locations $\mathbf{x}=(x_1,\ldots,x_k)$ where we want to predict $p(x_1,\ldots,x_k)$. However, there are **two differences** with respect to the use of `predict` for `lm`:
- The argument `type`. `type = "link"`, gives the predictions in the log-odds, this is, returns $\log\frac{\hat p(x_1,\ldots,x_k)}{1-\hat p(x_1,\ldots,x_k)}$. `type = "response"` gives the predictions in the probability space $[0,1]$, this is, returns $\hat p(x_1,\ldots,x_k)$.
- There is no `interval` argument for using `predict` for `glm`. That means that there is no easy way of computing CIs for prediction.
Since it is a bit cumbersome to compute by yourself the CIs, we can code the function `predictCIsLogistic` so that it computes them automatically for you, see below.
```{r, collapse = TRUE, cache = TRUE}
# Data for which we want a prediction
# Important! You have to name the column with the predictor name!
newdata <- data.frame(temp = -0.6)
# Prediction of the conditional log-odds - the default
predict(nasa, newdata = newdata, type = "link")
# Prediction of the conditional probability
predict(nasa, newdata = newdata, type = "response")
# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {
# Compute predictions in the log-odds
pred <- predict(object = object, newdata = newdata, se.fit = TRUE)
# CI in the log-odds
za <- qnorm(p = (1 - level) / 2)
lwr <- pred$fit + za * pred$se.fit
upr <- pred$fit - za * pred$se.fit
# Transform to probabilities
fit <- 1 / (1 + exp(-pred$fit))
lwr <- 1 / (1 + exp(-lwr))
upr <- 1 / (1 + exp(-upr))
# Return a matrix with column names "fit", "lwr" and "upr"
result <- cbind(fit, lwr, upr)
colnames(result) <- c("fit", "lwr", "upr")
return(result)
}
# Simple call
predictCIsLogistic(nasa, newdata = newdata)
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)
```
```{block, type = 'rmdexercise'}
For the `challenger` dataset, do the following:
- Regress `fail.nozzle` on `temp` and `pres.nozzle`.
- Compute the predicted probability of `fail.nozzle=1` for `temp`$=15$ and `pres.nozzle`$=200$. What is the predicted probability for `fail.nozzle=0`?
- Compute the confidence interval for the two predicted probabilities at level $95\%$.
```
Finally, Figure \@ref(fig:logcipred) gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section \@ref(simplin-prediction) and Figure \@ref(fig:cipred).
(ref:logcipredtitle) Illustration of the CIs for the conditional probability in the simple logistic regression. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-ci-prediction/).
```{r, logcipred, echo = FALSE, fig.cap = '(ref:logcipredtitle)', screenshot.alt = "images/screenshots/log-ci-prediction.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}