03-log-reg.Rmd


# Logistic regression {#logreg}

As we saw in Chapters \@ref(simplin) and \@ref(multlin), linear regression assumes that the response variable $Y$ is *continuous*. In this chapter we will see how *logistic regression* can deal with a *discrete* response $Y$. The simplest case is with $Y$ being a *binary* response, that is, a variable encoding two categories. In general, we assume that we have $X_1,\ldots,X_k$ predictors for explaining $Y$ (*multiple* logistic regression) and cover the peculiarities for $k=1$ as particular cases.

## More `R` basics {#logreg-R}

In order to implement some of the contents of this chapter we need to cover more `R` basics, mostly related with flexible plotting that is not implemented directly in `R Commander`. The `R` functions we will are also very useful for simplifying some `R Commander` approaches.

In the following sections, **type** -- not copy and paste systematically -- the code in the `'R Script'` panel and send it to the output panel. Remember that you should get the same outputs (which are preceded by `## [1]`).

### Data frames revisited {#logreg-R-dataframes}

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# Let's begin importing the iris dataset
data(iris)

# names gives you the variables in the data frame
names(iris)

# The beginning of the data
head(iris)

# So we can access variables by $ or as in a matrix
iris$Sepal.Length[1:10]
iris[1:10, 1]
iris[3, 1]

# Information on the dimension of the data frame
dim(iris)

# str gives the structure of any object in R
str(iris)

# Recall the species variable: it is a categorical variable (or factor),
# not a numeric variable
iris$Species[1:10]

# Factors can only take certain values
levels(iris$Species)

# If a file contains a variable with character strings as observations (either
# encapsulated by quotation marks or not), the variable will become a factor
# when imported into R

```

```{block, type = 'rmdexercise'}
Do the following:

- Import [`auto.txt`](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/auto.txt) into `R` as the data frame `auto`. Check how the character strings in the file give rise to factor variables.
- Get the dimensions of `auto` and show beginning of the data.
- Retrieve the fifth observation of `horsepower` in two different ways.
- Compute the levels of `name`.
```

### Vector-related functions {#logreg-R-vector}

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# The function seq creates sequences of numbers equally separated
seq(0, 1, by = 0.1)
seq(0, 1, length.out = 5)

# You can short the latter argument
seq(0, 1, l = 5)

# Repeat  number
rep(0, 5)

# Reverse a vector
myVec <- c(1:5, -1:3)
rev(myVec)

# Another way
myVec[length(myVec):1]

# Count repetitions in your data
table(iris$Sepal.Length)
table(iris$Species)
```

```{block, type = 'rmdexercise'}
Do the following:

- Create the vector $x=(0.3, 0.6, 0.9, 1.2)$.
- Create a vector of length 100 ranging from $0$ to $1$ with entries equally separated.
- Compute the amount of zeros and ones in `x <- c(0, 0, 1, 0, 1, 0, 0, 1, 0, 1, 0)`. Check that they are the same as in `rev(x)`.
- Compute the vector $(0.1, 1.1, 2.1, ..., 100.1)$ in four different ways using `seq` and `rev`. Do the same but using `:` instead of `seq`. (Hint: add `0.1`)

```

### Logical conditions and subsetting {#logreg-R-logic}

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# Relational operators: x < y, x > y, x <= y, x >= y, x == y, x!= y
# They return TRUE or FALSE

# Smaller than
0 < 1

# Greater than
1 > 1

# Greater or equal to
1 >= 1 # Remember: ">="" and not "=>"" !

# Smaller or equal to
2 <= 1 # Remember: "<="" and not "=<"" !

# Equal
1 == 1 # Tests equality. Remember: "=="" and not "="" !

# Unequal
1 != 0 # Tests inequality

# TRUE is encoded as 1 and FALSE as 0
TRUE + 1
FALSE + 1

# In a vector-like fashion
x <- 1:5
y <- c(0, 3, 1, 5, 2)
x < y
x == y
x != y

# Subsetting of vectors
x
x[x >= 2]
x[x < 3]

# Easy way of work with parts of the data
data <- data.frame(x = c(0, 1, 3, 3, 0), y = 1:5)
data

# Data such that x is zero
data0 <- data[data$x == 0, ]
data0

# Data such that x is larger than 2
data2 <- data[data$x > 2, ]
data2

# In an example
iris$Sepal.Width[iris$Sepal.Width > 3]

# Problem - what happened?
data[x > 2, ]

# In an example
summary(iris)
summary(iris[iris$Sepal.Width > 3, ])

# On the factor variable only makes sense == and !=
summary(iris[iris$Species == "setosa", ])

# Subset argument in lm
lm(Sepal.Width ~ Petal.Length, data = iris, subset = Sepal.Width > 3)
lm(Sepal.Width ~ Petal.Length, data = iris, subset = iris$Sepal.Width > 3)
# Both iris$Sepal.Width and Sepal.Width in subset are fine: data = iris
# tells R to look for Sepal.Width in the iris dataset

# Same thing for the subset field in R Commander's menus

# AND operator &
TRUE & TRUE
TRUE & FALSE
FALSE & FALSE

# OR operator |
TRUE | TRUE
TRUE | FALSE
FALSE | FALSE

# Both operators are useful for checking for ranges of data
y
index1 <- (y <= 3) & (y > 0)
y[index1]
index2 <- (y < 2) | (y > 4)
y[index2]

# In an example
summary(iris[iris$Sepal.Width > 3 & iris$Sepal.Width < 3.5, ])
```
```{block, type = 'rmdexercise'}
Do the following for the `iris` dataset:

- Compute the subset corresponding to `Petal.Length` either smaller than `1.5` or larger than `2`. Save this dataset as `irisPetal`.
- Compute and summarize a linear regression of `Sepal.Width` into `Petal.Width + Petal.Length` for the dataset `irisPetal`. What is the $R^2$? (Solution: `0.101`)
- Check that the previous model is the same as regressing `Sepal.Width` into `Petal.Width + Petal.Length` for the dataset `iris` with the appropriate `subset` expression.
- Compute the variance for `Petal.Width` when `Petal.Width` is smaller or equal that `1.5` and larger than `0.3`. (Solution: `0.1266541`)

```

### Plotting functions {#logreg-R-plot}

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# plot is the main function for plotting in R
# It has a different behavior depending on the kind of object that it receives

# For example, for a regression model, it produces diagnostic plots
mod <- lm(Sepal.Width ~ Sepal.Length, data = iris)
plot(mod, 1)

# How to plot some data
plot(iris$Sepal.Length, iris$Sepal.Width, main = "Sepal.Length vs Sepal.Width")

# Change the axis limits
plot(iris$Sepal.Length, iris$Sepal.Width, xlim = c(0, 10), ylim = c(0, 10))

# How to plot a curve (a parabola)
x <- seq(-1, 1, l = 50)
y <- x^2
plot(x, y)
plot(x, y, main = "A dotted parabola")
plot(x, y, main = "A parabola", type = "l")
plot(x, y, main = "A red and thick parabola", type = "l", col = "red", lwd = 3)

# Plotting a more complicated curve between -pi and pi
x <- seq(-pi, pi, l = 50)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l") # Kind of rough...

# More detailed plot
x <- seq(-pi, pi, l = 500)
y <- (2 + sin(10 * x)) * x^2
plot(x, y, type = "l")

# Remember that we are joining points for creating a curve!

# For more options in the plot customization see
?plot
?par

# plot is a first level plotting function. That means that whenever is called,
# it creates a new plot. If we want to add information to an existing plot, we
# have to use a second level plotting function such as points, lines or abline

plot(x, y) # Create a plot
lines(x, x^2, col = "red") # Add lines
points(x, y + 10, col = "blue") # Add points
abline(a = 5, b = 1, col = "orange", lwd = 2) # Add a straight line y = a + b * x
```

### Distributions {#logreg-R-distribs}

The operations on distributions described here are implemented in `R Commander` through the menu `'Distributions'`, but is convenient for you to grasp how are they working.

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# R allows to sample [r], compute density/probability mass [d],
# compute distribution function [p] and compute quantiles [q] for several
# continuous and discrete distributions. The format employed is [rdpq]name,
# where name stands for:
# - norm -> Normal
# - unif -> Uniform
# - exp -> Exponential
# - t -> Student's t
# - f -> Snedecor's F
# - chisq -> Chi squared
# - pois -> Poisson
# - binom -> Binomial
# More distributions:
?Distributions

# Sampling from a Normal - 100 random points from a N(0, 1)
rnorm(n = 10, mean = 0, sd = 1)

# If you want to have always the same result, set the seed of the random number
# generator
set.seed(45678)
rnorm(n = 10, mean = 0, sd = 1)

# Plotting the density of a N(0, 1) - the Gauss bell
x <- seq(-4, 4, l = 100)
y <- dnorm(x = x, mean = 0, sd = 1)
plot(x, y, type = "l")

# Plotting the distribution function of a N(0, 1)
x <- seq(-4, 4, l = 100)
y <- pnorm(q = x, mean = 0, sd = 1)
plot(x, y, type = "l")

# Computing the 95% quantile for a N(0, 1)
qnorm(p = 0.95, mean = 0, sd = 1)

# All distributions have the same syntax: rname(n,...), dname(x,...), dname(p,...)  
# and qname(p,...), but the parameters in ... change. Look them in ?Distributions
# For example, here is the same for the uniform distribution

# Sampling from a U(0, 1)
set.seed(45678)
runif(n = 10, min = 0, max = 1)

# Plotting the density of a U(0, 1)
x <- seq(-2, 2, l = 100)
y <- dunif(x = x, min = 0, max = 1)
plot(x, y, type = "l")

# Computing the 95% quantile for a U(0, 1)
qunif(p = 0.95, min = 0, max = 1)

# Sampling from a Bi(10, 0.5)
set.seed(45678)
samp <- rbinom(n = 200, size = 10, prob = 0.5)
table(samp) / 200

# Plotting the probability mass of a Bi(10, 0.5)
x <- 0:10
y <- dbinom(x = x, size = 10, prob = 0.5)
plot(x, y, type = "h") # Vertical bars

# Plotting the distribution function of a Bi(10, 0.5)
x <- 0:10
y <- pbinom(q = x, size = 10, prob = 0.5)
plot(x, y, type = "h")

```
```{block, type = 'rmdexercise'}
Do the following:

- Compute the 90%, 95% and 99% quantiles of a $F$ distribution with `df1 = 1` and `df2 = 5`. (Answer: `c(4.060420, 6.607891, 16.258177)`)
- Plot the distribution function of a $U(0,1)$. Does it make sense with its density function?
- Sample 100 points from a Poisson with `lambda = 5`.
- Sample 100 points from a $U(-1,1)$ and compute its mean.
- Plot the density of a $t$ distribution with `df = 1` (use a sequence spanning from `-4` to `4`). Add lines of different colors with the densities for `df = 5`, `df = 10`, `df = 50` and `df = 100`. Do you see any pattern?
```

### Defining functions {#logreg-R-funcs}

```{r, echo = TRUE, collapse = TRUE, error = TRUE, cache = TRUE}
# A function is a way of encapsulating a block of code so it can be reused easily
# They are useful for simplifying repetitive tasks and organize the analysis
# For example, in Setion 3.7 we had to make use of simpleAnova for computing
# the simple ANOVA table in multiple regression.

# This is a silly function that takes x and y and returns its sum
add <- function(x, y) {
  x + y
}

# Calling add - you need to run the definition of the function first!
add(1, 1)
add(x = 1, y = 2)

# A more complex function: computes a linear model and its posterior summary.
# Saves us a few keystrokes when computing a lm and a summary
lmSummary <- function(formula, data) {
  model <- lm(formula = formula, data = data)
  summary(model)
}

# Usage
lmSummary(Sepal.Length ~ Petal.Width, iris)

# Recall: there is no variable called model in the workspace.
# The function works on its own workspace!
model

# Add a line to a plot
addLine <- function(x, beta0, beta1) {
  lines(x, beta0 + beta1 * x, lwd = 2, col = 2)
}

# Usage
plot(x, y)
addLine(x, beta0 = 0.1, beta1 = 0)
```


## Examples and applications {#logreg-examps}

### Case study: *The Challenger disaster* {#logreg-examps-challenger}

The *Challenger* disaster occurred on the 28th January of 1986, when the NASA Space Shuttle orbiter *Challenger* broke apart and disintegrated at 73 seconds into its flight, leading to the deaths of its seven crew members. The accident deeply shocked the US society, in part due to the attention the mission had received because of the presence of Christa McAuliffe, who would have been the first astronaut-teacher. Because of this, NASA TV broadcasted live the launch to US public schools, which resulted in millions of school children witnessing the accident. The accident had serious consequences for the NASA credibility and resulted in an interruption of 32 months in the shuttle program. The Presidential *Rogers Commission* (formed by astronaut Neil A. Armstrong and Nobel laureate Richard P. Feynman, among others) was created to investigate the disaster.

```{r, video3, echo = FALSE, fig.align = 'center', fig.pos = 'h!', fig.cap = 'Challenger launch and posterior explosion, as broadcasted live by NBC in 28/01/1986.', screenshot.opts = list(delay = 15), dev = 'png', cache = TRUE}
knitr::include_url("https://www.youtube.com/embed/fSTrmJtHLFU")
```

The Rogers Commission elaborated a report [@Roberts1986] with all the findings. The commission determined that the disintegration began with the **failure of an O-ring seal in the solid rocket motor due to the unusual cold temperatures (-0.6 Celsius degrees)** during the launch. This failure produced a breach of burning gas through the solid rocket motor that compromised the whole shuttle structure, resulting in its disintegration due to the extreme aerodynamic forces. The **problematic with O-rings was something known**: the night before the launch, there was a three-hour teleconference between motor engineers and NASA management, discussing the effect of low temperature forecasted for the launch on the O-ring performance. The conclusion, influenced by Figure \@ref(fig:rogerts)a, was:

> "Temperature data [are] not conclusive on predicting primary O-ring blowby."

(ref:rogertstitle) Number of incidents in the O-rings (filed joints) versus temperatures. Panel *a* includes only flights with incidents. Panel *b* contains all flights (with and without incidents).

```{r, rogerts, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:rogertstitle)', fig.show = 'hold', cache = TRUE}
knitr::include_graphics("images/figures/challenger.png")
```

The Rogers Commission noted a major flaw in Figure \@ref(fig:rogerts)a: the **flights with zero incidents were excluded** from the plot because *it was felt* that **these flights did not contribute any information about the temperature effect** (Figure \@ref(fig:rogerts)b). The Rogers Commission concluded:

> "A careful analysis of the flight history of O-ring performance would have revealed the correlation of O-ring damage in low temperature".

The purpose of this case study, inspired by @Dalal1989, is to quantify what was the influence of the temperature in the probability of having at least one incident related with the O-rings. Specifically, we want to address the following questions:

- Q1. *Is the temperature associated with O-ring incidents?*
- Q2. *In which way was the temperature affecting the probability of O-ring incidents?*
- Q3. *What was the predicted probability of an incidient in an O-ring for the temperature of the launch day?*

To try to answer these questions we have the `challenger` dataset ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/challenger.txt)). The dataset contains (shown in Table \@ref(tab:challengertable)) information regarding the state of the solid rocket boosters after launch^[After the shuttle exits the atmosphere, the solid rocket boosters separate and descend to land using a parachute where they are carefully analyzed.] for 23 flights. Each row has, among others, the following variables:

- `fail.field`, `fail.nozzle`: binary variables indicating whether there was an incident with the O-rings in the field joints or in the nozzles of the solid rocket boosters. `1` codifies an incident and `0` its absence. On the analysis, we focus on the O-rings of the field joint as being the most determinants for the accident.
- `temp`: temperature in the day of launch. Measured in Celsius degrees.
- `pres.field`,	`pres.nozzle`: leak-check pressure tests of the O-rings. These tests assured that the rings would seal the joint.

(ref:challengertabletitle) The `challenger` dataset.

```{r, challengertable, echo = FALSE, out.width = '90%', fig.align = 'center', fig.pos = 'h!', cache = TRUE}
challenger <- read.table(file = "datasets/challenger.txt", header = TRUE, sep = "\t")
knitr::kable(
  challenger[, c(1, 2, 5:7)],
  booktabs = TRUE,
  longtable = TRUE,
  caption = '(ref:challengertabletitle)'
)
```

Let's begin the analysis by replicating Figures \@ref(fig:rogerts)a and \@ref(fig:rogerts)b and checking that linear regression is not the right tool for answering Q1--Q3. For that, we make two scatterplots of `nfails.field` (number of total incidents in the field joints) versus `temp`, the first one excluding the launches without incidents (`subset = nfails.field > 0`) and the second one for all the data. Doing it through `R Commander` as we saw in Chapter \@ref(simplin), you should get something similar to:
```{r, echo = FALSE, message = FALSE, warning = FALSE, cache = TRUE}
library(car)
```
```{r, collapse = TRUE, cache = TRUE}
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,
            boxplots = FALSE, data = challenger, subset = nfails.field > 0)
scatterplot(nfails.field ~ temp, reg.line = lm, smooth = FALSE, spread = FALSE,
            boxplots = FALSE, data = challenger)
```

There is a fundamental problem in using linear regression for this data: **the response is not continuous**. As a consequence, there is no linearity and the errors around the mean are not normal (indeed, they are strongly non normal). Let's check this with the corresponding diagnostic plots:
```{r, collapse = TRUE, cache = TRUE}
mod <- lm(nfails.field ~ temp, data = challenger)
par(mfrow = 1:2)
plot(mod, 1)
plot(mod, 2)
```

Albeit linear regression is not the adequate tool for this data, it is able to detect the obvious difference between the two plots:

1. **The trend for launches with incidents is flat, hence suggesting there is no dependence on the temperature** (Figure \@ref(fig:rogerts)a). This was one of the arguments behind NASA's decision of launching the rocket at a temperature of -0.6 degrees.
2. However, **the trend for *all* launches indicates a clear negative dependence between temperature and number of incidents!** (Figure \@ref(fig:rogerts)b). Think about it in this way: the minimum temperature for a launch without incidents ever recorded was above 18 degrees, and the Challenger was launched at -0.6 without clearly knowing the effects of such low temperatures.

Instead of trying to predict the number of incidents, we will concentrate on modeling the *probability of expecting at least one incident given the temperature*, a simpler but also revealing approach. In other words, we look to estimate the following curve:
$$
p(x)=\mathbb{P}(\text{incident}=1|\text{temperature}=x)
$$
from `fail.field` and `temp`. This probability can not be properly modeled as a linear function like $\beta_0+\beta_1x$, since inevitably will fall outside $[0,1]$ for some value of $x$ (some will have negative probabilities or probabilities larger than one). The technique that solves this problem is the **logistic regression**. The idea behind is quite simple: transform a linear model $\beta_0+\beta_1x$ -- which is aimed for a response in $\mathbb{R}$ -- so that it yields a value in $[0,1]$. This is achieved by the *logistic function*
\begin{align}
\text{logistic}(t)=\frac{e^t}{1+e^t}=\frac{1}{1+e^{-t}}.(\#eq:log)
\end{align}
The logistic model in this case is
$$
\mathbb{P}(\text{incident}=1|\text{temperature}=x)=\text{logistic}\left(\beta_0+\beta_1x\right)=\frac{1}{1+e^{-(\beta_0+\beta_1x)}},
$$
with $\beta_0$ and $\beta_1$ unknown. Let's fit the model to the data by estimating $\beta_0$ and $\beta_1$.

In order to fit a logistic regression to the data, go to `'Statistics' -> 'Fit models' -> 'Generalized linear model...'`. A window like Figure \@ref(fig:glm) will pop-up, which you should fill as indicated.

```{r, glm, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = 'Window for performing logistic regression.', cache = TRUE}
knitr::include_graphics("images/screenshots/glm.png")
```
A code like this will be generated:
```{r, collapse = TRUE, cache = TRUE}
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summary(nasa)
exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")
```
The summary of the logistic model is notably different from the linear regression, as the methodology behind is quite different. Nevertheless, we have tests for the significance of each coefficient. Here we obtain that `temp` is significantly different from zero, at least at a level $\alpha=0.05$. Therefore we can conclude that **the temperature is indeed affecting the probability of an incident with the O-rings** (answers Q1).

The precise interpretation of the coefficients will be given in the next section. For now, the coefficient of `temp`, $\hat\beta_1$, can be regarded the "correlation between the temperature and the probability of having at least one incident". This correlation, as evidenced by the sign of $\hat\beta_1$, is negative. Let's plot the fitted logistic curve to see that indeed the probability of incident and temperature are negatively correlated:
```{r, collapse = TRUE, cache = TRUE}
# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",
     ylab = "Incident probability")

# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)

# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)
```
At the sight of this curve and the summary of the model we can conclude that **the temperature was increasing the probability of an O-ring incident** (Q2). Indeed, the confidence intervals for the coefficients show a significative negative correlation at level $\alpha=0.05$:
```{r, collapse = TRUE, message = FALSE, cache = TRUE}
confint(nasa, level = 0.95)
```
Finally, **the probability of having at least one incident with the O-rings in the launch day was $0.9996$ according to the fitted logistic model** (Q3). This is easily obtained:
```{r, collapse = TRUE, cache = TRUE}
predict(nasa, newdata = data.frame(temp = -0.6), type = "response")
```
Be aware that `type = "response"` has a different meaning in logistic regression. As you can see it does not return a CI for the prediction as in linear models. Instead, `type = "response"` means that the *probability* should be returned, instead of the value of the link function, which is returned with `type = "link"` (the default).

Recall that there is a serious problem of **extrapolation** in the prediction, which makes it less precise (or more variable). But this extrapolation, together with the evidences raised by a simple analysis like we did, should have been strong arguments for postponing the launch.

To conclude this section, we refer to a funny and comprehensive exposition by Juan Cuesta (University of Cantabria) on the flawed statistical analysis that contributed to the Challenger disaster.
```{r, video4, echo = FALSE, fig.align = 'center', fig.pos = 'h!', fig.cap = 'The Challenger disaster and other elegant applications of statistics in complex problems.', screenshot.opts = list(delay = 15), dev = 'png', cache = TRUE}
knitr::include_url("https://www.youtube.com/embed/MOeDbQlF5Tw?start=552")
```

## Model formulation and estimation by maximum likelihood {#logreg-model}

As we saw in Section \@ref(multlin-model), the multiple linear model described the relation between the random variables $X_1,\ldots,X_k$ and $Y$ by assuming the linear relation
\begin{align*}
Y = \beta_0 + \beta_1 X_1 + \ldots + \beta_k X_k + \varepsilon.
\end{align*}
Since we assume $\mathbb{E}[\varepsilon|X_1=x_1,\ldots,X_k=x_k]=0$, the previous equation was equivalent to
\begin{align}
\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]=\beta_0+\beta_1x_1+\ldots+\beta_kx_k,(\#eq:eq-lm)
\end{align}
where $\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k]$ is the mean of $Y$ for a particular value of the set of predictors. As remarked in Section \@ref(multlin-assumps), it was a *necessary condition that $Y$ was continuous in order to satisfy the normality of the errors*, hence the linear model assumptions. Or in other words, **the linear model is designed for a continuous response**.

The situation when $Y$ is *discrete* (naturally ordered values; e.g. number of fails, number of students) or *categorical* (non-ordered categories; e.g. territorial divisions, ethnic groups) requires a special treatment. The simplest situation is when $Y$ is *binary* (or *dichotomic*): it can only take two values, codified for convenience as $1$ (success) and $0$ (failure). For example, in the Challenger case study we used `fail.field` as an *indicator* of whether "there was at least an incident with the O-rings" (`1` = yes, `0` = no). For binary variables there is no fundamental distinction between the treatment of discrete and categorical variables.

More formally, a binary variable is known as a *Bernoulli variable*, which is the simplest non-trivial random variable. We say that $Y\sim\mathrm{Ber}(p)$, $0\leq p\leq1$, if
\[
Y=\left\{\begin{array}{ll}1,&\text{with probability }p,\\0,&\text{with probability }1-p,\end{array}\right.
\]
or, equivalently, if $\mathbb{P}[Y=1]=p$ and $\mathbb{P}[Y=0]=1-p$, which can be written compactly as
\begin{align}
\mathbb{P}[Y=y]=p^y(1-p)^{1-y},\quad y=0,1.(\#eq:ber)
\end{align}
Recall that a *binomial variable with size $n$ and probability $p$*, $\mathrm{Bi}(n,p)$, is obtained by summing $n$ independent $\mathrm{Ber}(p)$ (so $\mathrm{Ber}(p)$ is the same as $\mathrm{Bi}(1,p)$). This is why we need to use a `family = "binomial"` in `glm`, to indicate that the response is binomial.

```{block, type = 'rmdinsight'}
A Bernoulli variable $Y$ is completely determined by the probability $p$. **So do its mean and variance**:

- $\mathbb{E}[Y]=p\times1+(1-p)\times0=p$
- $\mathbb{V}\mathrm{ar}[Y]=p(1-p)$

In particular, recall that $\mathbb{P}[Y=1]=\mathbb{E}[Y]=p$.

This is something relatively uncommon (on a $\mathcal{N}(\mu,\sigma^2)$, $\mu$ determines the mean and $\sigma^2$ the variance) that has important consequences for the logistic model: we do not need a $\sigma^2$.
```

```{block, type = 'rmdexercise'}
Are these Bernoulli variables? If so, which is the value of $p$ and what could the codes $0$ and $1$ represent?

- The toss of a fair coin.
- A variable with mean $p$ and variance $p(1-p)$.
- The roll of a dice.
- A binary variable with mean $0.5$ and variance $0.45$.
- The winner of an election with two candidates.

```

Assume then that $Y$ is a binary/Bernoulli variable and that $X_1,\ldots,X_k$ are predictors associated to $Y$ (no particular assumptions on them). The purpose in *logistic regression* is to estimate
\[
p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\mathbb{E}[Y|X_1=x_1,\ldots,X_k=x_k],
\]
this is, how the probability of $Y=1$ is changing according to particular values, denoted by $x_1,\ldots,x_k$, of the random variables $X_1,\ldots,X_k$. $p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]$ stands for the *conditional probability of $Y=1$ given $X_1,\ldots,X_k$*. At sight of \@ref(eq:eq-lm), a tempting possibility is to consider the model
\[
p(x_1,\ldots,x_k)=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.
\]
However, such a model will run into serious problems inevitably: negative probabilities and probabilities larger than one. A solution is to consider a function to encapsulate the value of $z=\beta_0+\beta_1x_1+\ldots+\beta_kx_k$, in $\mathbb{R}$, and map it back to $[0,1]$. There are several alternatives to do so, based on distribution functions $F:\mathbb{R}\longrightarrow[0,1]$ that deliver $y=F(z)\in[0,1]$ (see Figure \@ref(fig:logitprobit)). Different choices of $F$ give rise to different models:

- **Uniform**. Truncate $z$ to $0$ and $1$ when $z<0$ and $z>1$, respectively.
- **Logit**. Consider the *logistic distribution function*:
\[
F(z)=\mathrm{logistic}(z)=\frac{e^z}{1+e^z}=\frac{1}{1+e^{-z}}.
\]
- **Probit**. Consider the *normal* distribution function, this is, $F=\Phi$.

(ref:logitprobittitle) Different transformations mapping the response of a simple linear regression $z=\beta_0+\beta_1x$ to $[0,1]$.

```{r, logitprobit, echo = FALSE, results = 'hide', out.width = '70%', fig.show = 'hold', fig.asp = 1, fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:logitprobittitle)', cache = TRUE}
# Regression
x <- seq(-3, 3, l = 200)
z <- 2 * x

# Truncation
z1 <- z
z1[z1 > 1] <- 1
z1[z1 < 0] <- 0

# Logit
z2 <- 1 / (1 + exp(-z))

# Probit
z3 <- pnorm(z)

# Comparison
plot(x, z, type = "l", ylim = c(-0.1, 1.1), ylab = "Probability", lwd = 2)
lines(x, z1, col = "blue", lwd = 2)
lines(x, z2, col = "red", lwd = 2)
lines(x, z3, col = "green", lwd = 2)
legend("topleft", legend = c("Linear regression", "Uniform", "Logit", "Probit"), lwd = 2,
       col = c("black", "blue", "red", "green"))
```

The **logistic** transformation is the most employed due to its **tractability, interpretability and smoothness**. Its inverse, $F^{-1}:[0,1]\longrightarrow\mathbb{R}$, known as the *logit function*, is
\[
\mathrm{logit}(p)=\mathrm{logistic}^{-1}(p)=\log\frac{p}{1-p}.
\]
This is a *link function*, this is, a function that maps a given space (in this case $[0,1]$) into $\mathbb{R}$. The term link function is employed in *generalized linear models*, which follow exactly the same philosophy of the logistic regression -- mapping the domain of $Y$ to $\mathbb{R}$ in order to apply there a linear model. We will concentrate here exclusively on the logit as a link function. Therefore, the *logistic model* is
\begin{align}
p(x_1,\ldots,x_k)=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}.(\#eq:eq-log)
\end{align}
The linear form inside the exponent has a clear interpretation:

- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k=0$, then $p(x_1,\ldots,x_k)=\frac{1}{2}$ ($Y=1$ and $Y=0$ are equally likely).
- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k<0$, then $p(x_1,\ldots,x_k)<\frac{1}{2}$ ($Y=1$ less likely).
- If $\beta_0+\beta_1x_1+\ldots+\beta_kx_k>0$, then $p(x_1,\ldots,x_k)>\frac{1}{2}$ ($Y=1$ more likely).

To be more precise on the interpretation of the coefficients $\beta_0,\ldots,\beta_k$ we need to introduce the *odds*. The **odds is an equivalent way of expressing the distribution of probabilities in a binary variable**. Since $\mathbb{P}[Y=1]=p$ and $\mathbb{P}[Y=0]=1-p$, both the success and failure probabilities can be inferred from $p$. Instead of using $p$ to characterize the distribution of $Y$, we can use
\begin{align}
\mathrm{odds}(Y)=\frac{p}{1-p}=\frac{\mathbb{P}[Y=1]}{\mathbb{P}[Y=0]}.(\#eq:eq-odds)
\end{align}
The odds is the *ratio between the probability of success and the probability of failure*. It is extensively used in betting^[Recall that the result of a bet is binary: you either win or lose the bet.] due to its better interpretability. For example, if a horse $Y$ has a probability $p=2/3$ of winning a race ($Y=1$), then the odds of the horse is
\[
\text{odds}=\frac{p}{1-p}=\frac{2/3}{1/3}=2.
\]
This means that the horse has a *probability of winning that is twice larger than the probability of losing*. This is sometimes written as a $2:1$ or $2 \times 1$ (spelled "two-to-one"). Conversely, if the odds of $Y$ is given, we can easily know what is the probability of success $p$, using the inverse of \@ref(eq:eq-odds):
\[
p=\mathbb{P}[Y=1]=\frac{\text{odds}(Y)}{1+\text{odds}(Y)}.
\]
For example, if the odds of the horse were $5$, that would correspond to a probability of winning $p=5/6$.

```{block, type ='rmdinsight'}
Recall that the odds is a number in $[0,+\infty]$. The $0$ and $+\infty$ values are attained for $p=0$ and $p=1$, respectively. The log-odds (or logit) is a number in $[-\infty,+\infty]$.
```
We can rewrite \@ref(eq:eq-log) in terms of the odds \@ref(eq:eq-odds). If we do so, we have:
\begin{align}
\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k)&=\frac{p(x_1,\ldots,x_k)}{1-p(x_1,\ldots,x_k)}\nonumber\\
&=e^{\beta_0+\beta_1x_1+\ldots+\beta_kx_k}\nonumber\\
&=e^{\beta_0}e^{\beta_1x_1}\ldots e^{\beta_kx_k}(\#eq:eq-odds-log1)
\end{align}
or, taking logarithms, the *log-odds* (or logit)
\begin{align}
\log(\mathrm{odds}(Y|X_1=x_1,\ldots,X_k=x_k))=\beta_0+\beta_1x_1+\ldots+\beta_kx_k.(\#eq:eq-odds-log2)
\end{align}
The conditional log-odds \@ref(eq:eq-odds-log2) plays here the role of the conditional mean for multiple linear regression. Therefore, we have an analogous interpretation for the coefficients:

- $\beta_0$: is the log-odds when $X_1=\ldots=X_k=0$.
- $\beta_j$, $1\leq j\leq k$: is the **additive** increment of the log-odds for an increment of one unit in $X_j=x_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ *do not change*.

The log-odds is not so easy to interpret as the odds. For that reason, an equivalent way of interpreting the coefficients, this time based on \@ref(eq:eq-odds-log1), is:

- $e^{\beta_0}$: is the odds when $X_1=\ldots=X_k=0$.
- $e^{\beta_j}$, $1\leq j\leq k$: is the **multiplicative** increment of the odds for an increment of one unit in $X_j=x_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ *do not change*. If the increment in $X_j$ is of $r$ units, then the multiplicative increment in the odds is $(e^{\beta_j})^r$.

As a consequence of this last interpretation, we have:
```{block2, type = 'rmdinsight'}
If $\beta_j>0$ (respectively, $\beta_j<0$) then $e^{\beta_j}>1$ ($e^{\beta_j}<1$) in \@ref(eq:eq-odds-log1). Therefore, an increment of one unit in $X_j$, provided that the remaining variables $X_1,\ldots,X_{j-1},X_{j+1},\ldots,X_k$ do not change, results in an increment (decrement) of the odds, this is, in an increment (decrement) of $\mathbb{P}[Y=1]$.
```

```{block, type = 'rmdcaution'}
Since the relationship between $p(X_1,\ldots,X_k)$ and $X_1,\ldots,X_k$ is not linear, **$\beta_j$ does not correspond to the change in $p(X_1,\ldots,X_k)$ associated with a one-unit increase in $X_j$**.
```

Let's visualize this concepts quickly with the output of the Challenger case study:
```{r, collapse = TRUE, cache = TRUE}
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summary(nasa)
exp(coef(nasa)) # Exponentiated coefficients ("odds ratios")

# Plot data
plot(challenger$temp, challenger$fail.field, xlim = c(-1, 30), xlab = "Temperature",
     ylab = "Incident probability")

# Draw the fitted logistic curve
x <- seq(-1, 30, l = 200)
y <- exp(-(nasa$coefficients[1] + nasa$coefficients[2] * x))
y <- 1 / (1 + y)
lines(x, y, col = 2, lwd = 2)

# The Challenger
points(-0.6, 1, pch = 16)
text(-0.6, 1, labels = "Challenger", pos = 4)
```

The exponentials of the estimated coefficients are:

- $e^{\hat\beta_0}=1965.974$. This means that, *when the temperature is zero*, the fitted odds is $1965.974$, so the probability of having an incident ($Y=1$) is $1965.974$ times larger than the probability of not having an incident ($Y=0$). In other words, the probability of having an incident at temperature zero is $\frac{1965.974}{1965.974+1}=0.999$.
- $e^{\hat\beta_1}=0.659$. This means that each Celsius degree increment in the temperature multiplies the fitted odds by a factor of $0.659\approx\frac{2}{3}$, hence reducing it.

The estimation of $\boldsymbol{\beta}=(\beta_0,\beta_1,\ldots,\beta_k)$ from a sample $(\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)$ is different than in linear regression. It is not based on minimizing the RSS but on the principle of *Maximum Likelihood Estimation* (MLE). MLE is based on the following *leitmotiv*: **what are the coefficients $\boldsymbol{\beta}$ that make the sample more likely?** Or in other words, **what coefficients make the model more probable, based on the sample**. Since $Y_i\sim \mathrm{Ber}(p(\mathbf{X}_i))$, $i=1,\ldots,n$, the likelihood of $\boldsymbol{\beta}$ is
\begin{align}
\text{lik}(\boldsymbol{\beta})=\prod_{i=1}^np(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}.(\#eq:eq-lik)
\end{align}
$\text{lik}(\boldsymbol{\beta})$ is the **probability of the data based on the model**. Therefore, it is a number between $0$ and $1$. Its detailed interpretation is the following:

- $\prod_{i=1}^n$ appears because the sample elements are assumed to be independent and we are computing the probability of observing the whole sample $(\mathbf{X}_{1},Y_1),\ldots,(\mathbf{X}_{n},Y_n)$. This probability is equal to the product of the *probabilities of observing each $(\mathbf{X}_{i},Y_i)$*.
- $p(\mathbf{X}_i)^{Y_i}(1-p(\mathbf{X}_i))^{1-Y_i}$ is the probability of observing  $(\mathbf{X}_{i},Y_i)$, as given by \@ref(eq:ber). Remember that $p$ depends on $\boldsymbol{\beta}$ due to \@ref(eq:eq-log).

Usually, the log-likelihood is considered instead of the likelihood for stability reasons -- the estimates obtained are exactly the same and are
\[
\hat{\boldsymbol{\beta}}=\arg\max_{\boldsymbol{\beta}\in\mathbb{R}^{k+1}}\log \text{lik}(\boldsymbol{\beta}).
\]
Unfortunately, due to the non-linearity of the optimization problem there are no explicit expressions for $\hat{\boldsymbol{\beta}}$. These have to be obtained numerically by means of an iterative procedure (the number of iterations required is printed in the output of `summary`). In low sample situations with perfect classification, the iterative procedure may not converge.

Figure \@ref(fig:maximumlikelihood) shows how the log-likelihood changes with respect to the values for $(\beta_0,\beta_1)$ in three data patterns.

(ref:maximumlikelihoodtitle) The logistic regression fit and its dependence on $\beta_0$ (horizontal displacement) and $\beta_1$ (steepness of the curve). Recall the effect of the sign of $\beta_1$ in the curve: if positive, the logistic curve has an 's' form; if negative, the form is a reflected 's'. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-maximum-likelihood/).

```{r, maximumlikelihood, echo = FALSE, fig.cap = '(ref:maximumlikelihoodtitle)', screenshot.alt = "images/screenshots/log-maximum-likelihood.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}
knitr::include_app('https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-maximum-likelihood/', height = '900px')
```

The data of the illustration has been generated with the following code:
```{r, cache = TRUE}
# Data
set.seed(34567)
x <- rnorm(50, sd = 1.5)
y1 <- -0.5 + 3 * x
y2 <- 0.5 - 2 * x
y3 <- -2 + 5 * x
y1 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y1)))
y2 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y2)))
y3 <- rbinom(50, size = 1, prob = 1 / (1 + exp(-y3)))

# Data
dataMle <- data.frame(x = x, y1 = y1, y2 = y2, y3 = y3)
```

Let's check that indeed the coefficients given by `glm` are the ones that maximize the likelihood given in the animation of Figure \@ref(fig:maximumlikelihood). We do so for `y ~ x1`.
```{r, echo = TRUE, collapse = TRUE, cache = TRUE}
mod <- glm(y1 ~ x, family = "binomial", data = dataMle)
mod$coefficients
```

```{block2, type = 'rmdexercise'}
For the regressions `y ~ x2` and `y ~ x3`, do the following:

- Check that the true $\boldsymbol{\beta}$ is close to maximizing the likelihood computed in Figure \@ref(fig:maximumlikelihood).
- Plot the fitted logistic curve and compare it with the one in Figure \@ref(fig:maximumlikelihood).

```

```{block, type = 'rmdinsight'}
In linear regression we relied on least squares estimation, in other words, the minimization of the RSS. *Why do we need MLE in logistic regression and not least squares?* The answer is two-fold:

1. **MLE is asymptotically optimal** when estimating unknown parameters in a model. That means that when the sample size $n$ is large, it is **guaranteed to perform better than any other estimation method**. Therefore, considering a least squares approach for logistic regression will result in suboptimal estimates.
2. In **multiple linear regression**, due to the *normality* assumption, **MLE and least squares estimation coincide**. So MLE is hidden under the form of the least squares, which is a more intuitive estimation procedure. Indeed, the maximized likelihood $\text{lik}(\hat{\boldsymbol{\beta}})$ in the linear model and the RSS are intimately related.
```

```{block, type = 'rmdinsight'}
As in the linear model, **the inclusion of a new predictor changes the coefficient estimates of the logistic model**.  
```

## Assumptions of the model {#logreg-assumps}

Some probabilistic assumptions are required for performing inference on the model parameters $\boldsymbol{\beta}$ from the sample $(\mathbf{X}_1, Y_1),\ldots,(\mathbf{X}_n, Y_n)$. These assumptions are somehow simpler than the ones for linear regression.

```{r, logisticmodel, echo = FALSE, out.width = '70%', fig.align = 'center', fig.pos = 'h!', fig.cap = 'The key concepts of the logistic model.', cache = TRUE}
knitr::include_graphics("images/R/logisticmodel.png")
```

The assumptions of the logistic model are the following:

i. **Linearity in the logit**^[An equivalent way of stating this assumption is $p(\mathbf{x})=\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)$.]: $\mathrm{logit}(p(\mathbf{x}))=\log\frac{
p(\mathbf{x})}{1-p(\mathbf{x})}=\beta_0+\beta_1x_1+\ldots+\beta_kx_k$.
ii. **Binariness**: $Y_1,\ldots,Y_n$ are binary variables.
iii. **Independence**: $Y_1,\ldots,Y_n$ are independent.

A good one-line summary of the logistic model is the following (independence is assumed)
\begin{align}
Y|(X_1=x_1,\ldots,X_k=x_k)&\sim\mathrm{Ber}\left(\mathrm{logistic}(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)\right)\nonumber\\
&=\mathrm{Ber}\left(\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}\right).(\#eq:condber)
\end{align}

There are three important points of the linear model assumptions  missing in the ones for the logistic model:

- *Why is homoscedasticity not required?* As seen in the previous section, Bernoulli variables are determined only by the probability of success, in this case $p(\mathbf{x})$. That determines also the variance, which is variable, so **there is heteroskedasticity**. In the linear model, we have to control $\sigma^2$ explicitly due to the higher flexibility of the normal.
- *Where are the errors? The errors played a fundamental role in the linear model assumptions, but are not employed in logistic regression.* The errors are not fundamental for building the linear model but just a helpful concept related to least squares. The linear model can be constructed without errors as \@ref(eq:condnorm), which has a logistic analogous in \@ref(eq:condber).
- *Why is normality not present?* A normal distribution is not adequate to replace the Bernoulli distribution in \@ref(eq:condber) since the response $Y$ *has to be binary* and the Normal or other continuous distribution would put yield illegal values for $Y$.

```{block, type = 'rmdinsight'}
Recall that:

- Nothing is said about the distribution of $X_1,\ldots,X_k$. They could be deterministic or random. They could be discrete or continuous.
- $X_1,\ldots,X_k$ are **not required to be independent** between them.
```

## Inference for model parameters {#logreg-inference}

The assumptions on which the logistic model is constructed allow to specify what is the *asymptotic* distribution of the *random vector* $\hat{\boldsymbol{\beta}}$. Again, the distribution is derived conditionally on the sample predictors $\mathbf{X}_1,\ldots,\mathbf{X}_n$. In other words, we assume that the randomness of $Y$ comes only from $Y|(X_1=x_1,\ldots,X_k=x_k)\sim\mathrm{Ber}(p(\mathbf{x}))$ and not from the predictors. To denote this, we employ lowercase for the sample predictors $\mathbf{x}_1,\ldots,\mathbf{x}_n$.

There is an important difference between the inference results for the linear model and for logistic regression:

- **In linear regression the inference is exact**. This is due to the nice properties of the normal, least squares estimation and linearity. As a consequence, the distributions of the coefficients are perfectly known assuming that the assumptions hold.
- **In logistic regression the inference is asymptotic**. This means that the distributions of the coefficients are unknown except for large sample sizes $n$, for which we have **approximations**. The reason is the more complexity of the model in terms of non-linearity. This is the usual situation for the majority of the regression models.

### Distributions of the fitted coefficients {#logreg-inference-distribs}

The distribution of $\hat{\boldsymbol{\beta}}$ is given by the asymptotic theory of MLE:
\begin{align}
\hat{\boldsymbol{\beta}}\sim\mathcal{N}_{k+1}\left(\boldsymbol{\beta},I(\boldsymbol{\beta})^{-1}\right)
(\#eq:mle1)
\end{align}
where $\sim$ must be understood as *approximately distributed as [...] when $n\to\infty$* for the rest of this chapter. $I(\boldsymbol{\beta})$ is known as the *Fisher information matrix*, and receives that name because *it measures the information available in the sample for estimating $\boldsymbol{\beta}$*. Therefore, the *larger* the matrix is, the more precise is the estimation of $\boldsymbol{\beta}$, because that results in smaller variances in \@ref(eq:mle1). The inverse of the Fisher information matrix is
\begin{align}
I(\boldsymbol{\beta})^{-1}=(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1},
(\#eq:mle2)
\end{align}
where $\mathbf{V}$ is a diagonal matrix containing the *different* variances for each $Y_i$ (remember that $p(\mathbf{x})=1/(1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)})$):
\[
\mathbf{V}=\begin{pmatrix}
p(\mathbf{X}_1)(1-p(\mathbf{X}_1)) & & &\\
& p(\mathbf{X}_2)(1-p(\mathbf{X}_2)) & & \\
& & \ddots & \\
& & & p(\mathbf{X}_n)(1-p(\mathbf{X}_n))
\end{pmatrix}
\]
In the case of the multiple linear regression, $I(\boldsymbol{\beta})^{-1}=\sigma^2(\mathbf{X}^T\mathbf{X})^{-1}$ (see \@ref(eq:normp)), so the presence of $\mathbf{V}$ here is revealing the heteroskedasticity of the model.

The interpretation of \@ref(eq:mle1) and \@ref(eq:mle2) gives some useful insights on what concepts affect the quality of the estimation:

- **Bias**. The estimates are *asymptotically* unbiased.
- **Variance**. It depends on:

    - *Sample size $n$*. Hidden inside $\mathbf{X}^T\mathbf{V}\mathbf{X}$. As $n$ grows, the precision of the estimators increases.
    - *Weighted predictor sparsity $(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}$*. The more *sparse* the predictor is (small $|(\mathbf{X}^T\mathbf{V}\mathbf{X})^{-1}|$), the more precise $\hat{\boldsymbol{\beta}}$ is.


```{block2, type = 'rmdinsight'}
**The precision of $\hat{\boldsymbol{\beta}}$ is affected by the value of $\boldsymbol{\beta}$**, which is hidden inside $\mathbf{V}$. This contrasts sharply with the linear model, where the precision of the least squares estimator was not affected by the value of the unknown coefficients (see \@ref(eq:normp)). The reason is partially due to the **heteroskedasticity** of logistic regression, which implies a dependence of the variance of $Y$ in the logistic curve, hence in $\boldsymbol{\beta}$.
```

(ref:lograndomcoefstitle) Illustration of the randomness of the fitted coefficients $(\hat\beta_0,\hat\beta_1)$ and the influence of $n$, $(\beta_0,\beta_1)$ and $s_x^2$. The sample predictors $x_1,\ldots,x_n$ are fixed and new responses $Y_1,\ldots,Y_n$ are generated each time from a logistic model $Y|X=x\sim\mathrm{Ber}(p(X))$. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-random/).

```{r, lograndomcoefs, echo = FALSE, fig.cap = '(ref:lograndomcoefstitle)', screenshot.alt = "images/screenshots/log-random.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}
knitr::include_app('https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-random/', height = '1000px')
```

Similar to linear regression, the problem with \@ref(eq:mle1) and \@ref(eq:mle2) is that *$\mathbf{V}$ is unknown* in practice because it depends on $\boldsymbol{\beta}$. Plugging-in the estimate $\hat{\boldsymbol{\beta}}$ to $\boldsymbol{\beta}$ in $\mathbf{V}$ results in $\hat{\mathbf{V}}$. Now we can use $\hat{\mathbf{V}}$ to get
\begin{align}
\frac{\hat\beta_j-\beta_j}{\hat{\mathrm{SE}}(\hat\beta_j)}\sim \mathcal{N}(0,1),\quad\hat{\mathrm{SE}}(\hat\beta_j)^2=v_j^2(\#eq:mle3)
\end{align}
where
\[
v_j\text{ is the }j\text{-th element of the diagonal of }(\mathbf{X}^T\hat{\mathbf{V}}\mathbf{X})^{-1}.
\]
The LHS of \@ref(eq:normp2) is the **Wald statistic** for $\beta_j$, $j=0,\ldots,k$. They are employed for building confidence intervals and hypothesis tests.

### Confidence intervals for the coefficients {#logreg-inference-cis}

Thanks to \@ref(eq:mle3), we can have the $100(1-\alpha)\%$ CI for the coefficient $\beta_j$, $j=0,\ldots,k$:
\begin{align}
\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)(\#eq:ciplog)
\end{align}
where $z_{\alpha/2}$ is the *$\alpha/2$-upper quantile of the $\mathcal{N}(0,1)$*. In case we are interested in the CI for $e^{\beta_j}$, we can just simply take the exponential on the above CI. So the $100(1-\alpha)\%$ CI for $e^{\beta_j}$, $j=0,\ldots,k$, is
\[
e^{\left(\hat\beta_j\pm\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}\right)}.
\]
Of course, this CI is **not** the same as $\left(e^{\hat\beta_j}\pm e^{\hat{\mathrm{SE}}(\hat\beta_j)z_{\alpha/2}}\right)$, which is not a CI for $e^{\hat\beta_j}$.

Let's see how we can compute the CIs. We return to the `challenger` dataset, so in case you do not have it loaded, you can download it [here](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/challenger.txt). We analyze the CI for the coefficients of `fail.field ~ temp`.
```{r, collapse = TRUE, cache = TRUE}
# Fit model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)

# Confidence intervals at 95%
confint(nasa)

# Confidence intervals at other levels
confint(nasa, level = 0.90)

# Confidence intervals for the factors affecting the odds
exp(confint(nasa))
```
In this example, the 95% confidence interval for $\beta_0$ is $(1.3364, 17.7834)$ and for $\beta_1$ is $(-0.9238, -0.1090)$. For $e^{\beta_0}$ and $e^{\beta_1}$, the CIs are $(3.8053, 5.2874\times10^7)$ and $(0.3070, 0.8967)$, respectively. Therefore, we can say with a 95% confidence that:

- When `temp`=0, the probability of `fail.field`=1 is significantly lager than the probability of `fail.field`=0 (using the CI for $\beta_0$). Indeed, `fail.field`=1 is between $3.8053$ and $5.2874\times10^7$ more likely than `fail.field`=0 (using the CI for $e^{\beta_0}$).
- `temp` has a significantly negative effect in the probability of `fail.field`=1  (using the CI for $\beta_1$). Indeed, each unit increase in `temp` produces a reduction of the odds of `fail.field` by a factor between $0.3070$ and $0.8967$ (using the CI for $e^{\beta_1}$).

```{block2, type = 'rmdexercise'}
Compute and interpret the CIs for the exponentiated coefficients, at level $\alpha=0.05$, for the following regressions (`challenger` dataset):

- `fail.field ~ temp + pres.field`
- `fail.nozzle ~ temp + pres.nozzle`
- `fail.field ~ temp + pres.nozzle`
- `fail.nozzle ~ temp + pres.field`

The interpretation of the variables is given above Table \@ref(tab:challengertable).
```

### Testing on the coefficients {#logreg-inference-tests}

The distributions in \@ref(eq:mle3) also allow to conduct a formal hypothesis test on the coefficients $\beta_j$, $j=0,\ldots,k$. For example, the test for significance:
\begin{align*}
H_0:\beta_j=0
\end{align*}
for $j=0,\ldots,k$. The test of $H_0:\beta_j=0$ with $1\leq j\leq k$ is especially interesting, since it allows to answer whether *the variable $X_j$ has a significant effect on $\mathbb{P}[Y=1]$*. The statistic used for testing for significance is the Wald statistic
\begin{align*}
\frac{\hat\beta_j-0}{\hat{\mathrm{SE}}(\hat\beta_j)},
\end{align*}
which is asymptotically distributed as a $\mathcal{N}(0,1)$ *under the (veracity of) the null hypothesis*. $H_0$ is tested *against* the *bilateral* alternative hypothesis $H_1:\beta_j\neq 0$.

The tests for significance are built-in in the `summary` function. However, a note of caution is required when applying the rule of thumb:

> Is the CI for $\beta_j$ below (above) $0$ at level $\alpha$?
>
>- Yes $\rightarrow$ reject $H_0$ at level $\alpha$.
>- No $\rightarrow$ the criterion is not conclusive.

```{block, type = 'rmdcaution'}
The significances given in `summary` and the output of `confint` are *slightly* incoherent and the previous rule of thumb **does not apply**. The reason is because `MASS`'s `confint` is using a more sophisticated method (profile likelihood) to estimate the standard error of $\hat\beta_j$, $\hat{\mathrm{SE}}(\hat\beta_j)$, and not the asymptotic distribution behind Wald statistic.

By changing `confint` to `R`'s default `confint.default`, the results of the latter will be completely equivalent to the significances in `summary`, and the rule of thumb still be completely valid. For the contents of this course we prefer `confint.default` due to its better interpretability.
```

To illustrate this we consider the regression of `fail.field ~ temp + pres.field`:
```{r, collapse = TRUE, cache = TRUE}
# Significances with asymptotic approximation for the standard errors
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
             data = challenger)
summary(nasa2)

# CIs with asymptotic approximation - coherent with summary
confint.default(nasa2, level = 0.90)
confint.default(nasa2, level = 0.99)

# CIs with profile likelihood - incoherent with summary
confint(nasa2, level = 0.90) # intercept still significant
confint(nasa2, level = 0.99) # temp still significant
```

```{block, type = 'rmdexercise'}
For the previous exercise, check the differences of using `confint` or `confint.default` for computing the CIs.
```

## Prediction {#logreg-prediction}

*Prediction* in logistic regression focuses mainly on predicting the values of the logistic curve
\[
p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\beta_0+\beta_1x_1+\ldots+\beta_kx_k)}}
\]
by means of
\[
\hat p(x_1,\ldots,x_k)=\hat{\mathbb{P}}[Y=1|X_1=x_1,\ldots,X_k=x_k]=\frac{1}{1+e^{-(\hat\beta_0+\hat\beta_1x_1+\ldots+\hat\beta_kx_k)}}.
\]
From the perspective of the linear model, this is the same as predicting the **conditional mean** (not the conditional response) of the response, but this time this conditional mean is also a **conditional probability**. The prediction of the conditional response is not so interesting since it follows immediately from $\hat p(x_1,\ldots,x_k)$:
\[
\hat{Y}|(X_1=x_1,\ldots,X_k=x_k)=\left\{\begin{array}{ll}1,&\text{with probability }\hat p(x_1,\ldots,x_k),\\0,&\text{with probability }1-\hat p(x_1,\ldots,x_k).\end{array}\right.
\]
As a consequence, we can predict $Y$ as $1$ if $\hat p(x_1,\ldots,x_k)>\frac{1}{2}$ and as $0$ if $\hat p(x_1,\ldots,x_k)<\frac{1}{2}$.

Let's focus then on how to make predictions and compute CIs in practice with `predict`. Similarly to the linear model, the objects required for `predict` are: first, the output of `glm`; second, a `data.frame` containing the locations $\mathbf{x}=(x_1,\ldots,x_k)$ where we want to predict $p(x_1,\ldots,x_k)$. However, there are **two differences** with respect to the use of `predict` for `lm`:

- The argument `type`. `type = "link"`, gives the predictions in the log-odds, this is, returns $\log\frac{\hat p(x_1,\ldots,x_k)}{1-\hat p(x_1,\ldots,x_k)}$. `type = "response"` gives the predictions in the probability space $[0,1]$, this is, returns $\hat p(x_1,\ldots,x_k)$.
- There is no `interval` argument for using `predict` for `glm`. That means that there is no easy way of computing CIs for prediction.

Since it is a bit cumbersome to compute by yourself the CIs, we can code the function `predictCIsLogistic` so that it computes them automatically for you, see below.
```{r, collapse = TRUE, cache = TRUE}
# Data for which we want a prediction
# Important! You have to name the column with the predictor name!
newdata <- data.frame(temp = -0.6)

# Prediction of the conditional log-odds - the default
predict(nasa, newdata = newdata, type = "link")

# Prediction of the conditional probability
predict(nasa, newdata = newdata, type = "response")

# Function for computing the predictions and CIs for the conditional probability
predictCIsLogistic <- function(object, newdata, level = 0.95) {

  # Compute predictions in the log-odds
  pred <- predict(object = object, newdata = newdata, se.fit = TRUE)

  # CI in the log-odds
  za <- qnorm(p = (1 - level) / 2)
  lwr <- pred$fit + za * pred$se.fit
  upr <- pred$fit - za * pred$se.fit

  # Transform to probabilities
  fit <- 1 / (1 + exp(-pred$fit))
  lwr <- 1 / (1 + exp(-lwr))
  upr <- 1 / (1 + exp(-upr))

  # Return a matrix with column names "fit", "lwr" and "upr"
  result <- cbind(fit, lwr, upr)
  colnames(result) <- c("fit", "lwr", "upr")
  return(result)

}

# Simple call
predictCIsLogistic(nasa, newdata = newdata)
# The CI is large because there is no data around temp = -0.6 and
# that makes the prediction more variable (and also because we only
# have 23 observations)
```

```{block, type = 'rmdexercise'}
For the `challenger` dataset, do the following:

- Regress `fail.nozzle` on `temp` and `pres.nozzle`.
- Compute the predicted probability of `fail.nozzle=1` for `temp`$=15$ and `pres.nozzle`$=200$. What is the predicted probability for `fail.nozzle=0`?
- Compute the confidence interval for the two predicted probabilities at level $95\%$.

```

Finally, Figure \@ref(fig:logcipred) gives an interactive visualization of the CIs for the conditional probability in simple logistic regression. Their interpretation is very similar to the CIs for the conditional mean in the simple linear model, see Section \@ref(simplin-prediction) and Figure \@ref(fig:cipred).

(ref:logcipredtitle) Illustration of the CIs for the conditional probability in the simple logistic regression. Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-ci-prediction/).

```{r, logcipred, echo = FALSE, fig.cap = '(ref:logcipredtitle)', screenshot.alt = "images/screenshots/log-ci-prediction.png", dev = 'png', cache = TRUE, fig.align = 'center', fig.pos = 'h!', out.width = '90%'}
knitr::include_app('https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/log-ci-prediction/', height = '1000px')
```

## Deviance and model fit {#logreg-deviance}

The **deviance** is a key concept in logistic regression. Intuitively, it measures the **deviance of the fitted logistic model with respect to a perfect model for $\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]$**. This perfect model, known as the *saturated model*, denotes an abstract model that fits perfectly the sample, this is, the model such that
\[
\hat{\mathbb{P}}[Y=1|X_1=X_{i1},\ldots,X_k=X_{ik}]=Y_i,\quad i=1,\ldots,n.
\]
This model assigns probability $0$ or $1$ to $Y$ depending on the actual value of $Y_i$. To clarify this concept, Figure \@ref(fig:saturated) shows a saturated model and a fitted logistic regression.

(ref:saturatedtitle) Fitted logistic regression versus *a* saturated model (several are possible depending on the interpolation between points) and the null model.

```{r, saturated, echo = FALSE, results = 'hide', out.width = '70%', fig.show = 'hold', fig.asp = 1, fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:saturatedtitle)', cache = TRUE}
set.seed(45671231)
x <- runif(50, 0, 1)
y <- rbinom(n = 50, size = 1, prob = 1/(1 + exp(-(1 - 6 * x))))
mod <- glm(y ~ x, family = "binomial")
xx <- seq(0, 1, l = 200)
o <- order(x)
plot(x, y, pch = 16)
lines(x[o], y[o], col = 4, lwd = 2)
lines(xx, 1/(1 + exp(-(mod$coefficients[1] + mod$coefficients[2] * xx))), col = 2, lwd = 2)
lines(xx, rep(sum(y == 1)/50, 200), col = 3, lwd = 3)
legend("topright", legend = c("Fitted logistic model", "A saturated model", "The null model"), lwd = 2, col = c(2, 4, 3))
```
More precisely, the deviance is defined as the difference of likelihoods between the fitted model and the saturated model:
\[
D=-2\log\text{lik}(\hat{\boldsymbol{\beta}})+2\log\text{lik}(\text{saturated model}).
\]
Since the likelihood of the saturated model is exactly one^[The probability of the sample according to the saturated is $1$ -- replace $p(\mathbf{X}_i)=Y_i$ in \@ref(eq:eq-lik).], then the deviance is simply another expression of the likelihood:
\[
D=-2\log\text{lik}(\hat{\boldsymbol{\beta}}).
\]
As a consequence, **the deviance is always larger or equal than zero, being zero only if the fit is perfect**.

A benchmark for evaluating the magnitude of the deviance is the **null deviance**,
\[
D_0=-2\log\text{lik}(\hat{\beta}_0),
\]
which is the **deviance of the worst model, the one fitted without any predictor, to the perfect model**:
\[
Y|(X_1=x_1,\ldots,X_k=x_k)\sim \mathrm{Ber}(\mathrm{logistic}(\beta_0)).
\]
In this case, $\hat\beta_0=\mathrm{logit}(\frac{m}{n})=\log\frac{\frac{m}{n}}{1-\frac{m}{n}}$ where $m$ is the number of $1$'s in $Y_1,\ldots,Y_n$ (see Figure \@ref(fig:saturated)).

The null deviance serves for comparing how much the model has improved by adding the predictors $X_1,\ldots,X_k$. This can be done by means of the **$R^2$ statistic**, which is a *generalization* of the determination coefficient in multiple linear regression:
\begin{align}
R^2=1-\frac{D}{D_0}=1-\frac{\text{deviance(fitted logistic, saturated model)}}{\text{deviance(null model, saturated model)}}.(\#eq:log-r2)
\end{align}
This global measure of fit shares some important properties with the  determination coefficient in linear regression:

1. It is a quantity between $0$ and $1$.
2. If the fit is perfect, then $D=0$ and $R^2=1$. If the predictors do not add anything to the regression, then $D=D_0$ and $R^2=0$.

```{block, type = 'rmdcaution'}
In logistic regression, $R^2$ does not have the same interpretation as in linear regression:

- Is **not the percentage of variance explained by the logistic model**, but rather a ratio indicating how close is the fit to being perfect or the worst.
- It is not related to any correlation coefficient.

```

```{block2, type = 'rmdinsight'}
The $R^2$ in \@ref(eq:log-r2) is valid for the whole family of *generalized linear models*, for which linear and logistic regression are particular cases. The connexion between \@ref(eq:log-r2) and the determination coefficient is given by the expressions of the deviance and null the deviance for the linear model:
\[
D=\mathrm{SSE}\text{ (or }D=\mathrm{RSS}\text{) and }D_0=\mathrm{SST}.
\]
```

Let's see how these concepts are given by the `summary` function:
```{r, collapse = TRUE, cache = TRUE}
# Summary of model
nasa <- glm(fail.field ~ temp, family = "binomial", data = challenger)
summaryLog <- summary(nasa)
summaryLog # 'Residual deviance' is the deviance; 'Null deviance' is the null deviance

# Null model - only intercept
null <- glm(fail.field ~ 1, family = "binomial", data = challenger)
summaryNull <- summary(null)
summaryNull

# Computation of the R^2 with a function - useful for repetitive computations
r2Log <- function(model) {

  summaryLog <- summary(model)
  1 - summaryLog$deviance / summaryLog$null.deviance

}

# R^2
r2Log(nasa)
r2Log(null)
```

Another way of evaluating the model fit is its **predictive accuracy**. The motivation is that most of the times we are interested simply in *classifying*, for an observation of the predictors, the value of $Y$ as either $0$ or $1$, but not in predicting the value of $p(x_1,\ldots,x_k)=\mathbb{P}[Y=1|X_1=x_1,\ldots,X_k=x_k]$. The classification in prediction is simply done by the rule
\[
\hat{Y}=\left\{\begin{array}{ll}
1,&\hat{p}(x_1,\ldots,x_k)>\frac{1}{2},\\
0,&\hat{p}(x_1,\ldots,x_k)<\frac{1}{2}.
\end{array}\right.
\]
The overall predictive accuracy can be summarized with the **hit matrix**

| Reality vs. classified | $\hat Y=0$ | $\hat Y=1$ |
|----------|----------|----------|
| $Y=0$ | Correct$_{0}$ | Incorrect$_{01}$ |
| $Y=1$ | Incorrect$_{10}$ | Correct$_{1}$ |

and with the **hit ratio** $\frac{\text{Correct}_0+\text{Correct}_1}{n}$. The hit matrix is easily computed with the `table` function. The function, whenever called with two vectors, computes the cross-table between the two vectors.
```{r, collapse = TRUE, cache = TRUE}
# Fitted probabilities for Y = 1
nasa$fitted.values

# Classified Y's
yHat <- nasa$fitted.values > 0.5

# Hit matrix:
# - 16 correctly classified as 0
# - 4 correctly classified as 1
# - 3 incorrectly classified as 0
tab <- table(challenger$fail.field, yHat)
tab

# Hit ratio (ratio of correct classification)
(16 + 4) / 23 # Manually
sum(diag(tab)) / sum(tab) # Automatically
```

It is important to recall that the hit matrix will be always *biased towards unrealistic good classification rates* if it is computed in the same sample used for fitting the logistic model. A familiar analogy is asking to your mother (data) whether you (model) are a good-looking human being (good predictive accuracy) -- the answer will be highly positively biased. To get a fair hit matrix, the right approach is to split randomly the sample into two: a *training dataset*, used for fitting the model, and a *test dataset*, used for evaluating the predictive accuracy.

## Model selection and multicollinearity {#logreg-modsel}

The same discussion we did in Section \@ref(multlin-modsel) is applicable to logistic regression with small changes:

1. The **deviance** of the model (reciprocally the likelihood and the $R^2$) **always decreases** (increase) with the inclusion of more predictors -- no matter whether they are significant or not.
2. The **excess of predictors** in the model is paid by a larger variability in the estimation of the model which results in less precise prediction.
3. **Multicollinearity** may hide significant variables, change the sign of them and result in an increase of the variability of the estimation.

The use of information criteria, `stepwise` and `vif` allow to efficiently fight back these issues. Let's review them quickly from the perspective of logistic regression.

First, remember that the BIC/AIC information criteria are based on a **balance between the model fitness, given by the likelihood, and its complexity**. In the logistic regression, the BIC is
\begin{align*}
\text{BIC}(\text{model}) &= -2\log \text{lik}(\hat{\boldsymbol{\beta}}) + (k + 1)\times\log n\\
&=D+(k+1)\times \log n,
\end{align*}
where $\text{lik}(\hat{\boldsymbol{\beta}})$ is the *likelihood of the model*. The AIC replaces $\log n$ by $2$, hence penalizing less model complexity. The BIC and AIC can be computed in `R` through the functions `BIC` and `AIC`, and we can check manually that they match with its definition.
```{r, collapse = TRUE, cache = TRUE}
# Models
nasa1 <- glm(fail.field ~ temp, family = "binomial", data = challenger)
nasa2 <- glm(fail.field ~ temp + pres.field, family = "binomial",
             data = challenger)

# nasa
summary1 <- summary(nasa1)
summary1
BIC(nasa1)
summary1$deviance + 2 * log(dim(challenger)[1])
AIC(nasa1)
summary1$deviance + 2 * 2

# nasa2
summary2 <- summary(nasa2)
summary2
BIC(nasa2)
summary2$deviance + 3 * log(dim(challenger)[1])
AIC(nasa2)
summary2$deviance + 3 * 2
```
Second, `stepwise` works analogously to the linear regression situation. Here is an illustration for a binary variable that measures whether a Boston suburb (`Boston` dataset) is wealth or not. The binary variable is `medv > 25`: it is `TRUE` (`1`) for suburbs with median house value larger than 25000\$) and `FALSE` (`0`) otherwise. The cutoff 25000\$ corresponds to the 25\% richest suburbs.

```{r, collapse = TRUE, cache = TRUE}
# Boston dataset
data(Boston)

# Model whether a suburb has a median house value larger than 25000$
mod <- glm(I(medv > 25) ~ ., data = Boston, family = "binomial")
summary(mod)
r2Log(mod)

# With BIC - ends up with only the significant variables and a similar R^2
modBIC <- stepwise(mod, trace = 0)
summary(modBIC)
r2Log(modBIC)
```

Finally, **multicollinearity can also be present in logistic regression**. Despite the nonlinear logistic curve, the predictors are combined linearly in \@ref(eq:eq-log). Due to this, if two or more predictors are highly correlated between them, the fit of the model will be compromised since the individual linear effect of each predictor is hard to disentangle from the rest of correlated predictors.

In addition to inspecting the correlation matrix and look for high correlations, a powerful tool to detect multicollinearity is the VIF of each coefficient $\hat\beta_j$. The situation is exactly the same as in linear regression, since VIF looks only into the linear relations of the predictors. Therefore, the rule of thumb gives is the same as in Section \@ref(multlin-diagnostics):

- VIF close to 1: absence of multicollinearity.
- **VIF larger than 5 or 10: problematic amount of multicollinearity**. Advised to remove the predictor with largest VIF.

Here is an example illustrating the use of VIF, through `vif`, in practice. It also shows also how the simple inspection of the covariance matrix is not enough for detecting collinearity in tricky situations.
```{r, collapse = TRUE, cache = TRUE}
# Create predictors with multicollinearity: x4 depends on the rest
set.seed(45678)
x1 <- rnorm(100)
x2 <- 0.5 * x1 + rnorm(100)
x3 <- 0.5 * x2 + rnorm(100)
x4 <- -x1 + x2 + rnorm(100, sd = 0.25)

# Response
z <- 1 + 0.5 * x1 + 2 * x2 - 3 * x3 - x4
y <- rbinom(n = 100, size = 1, prob = 1/(1 + exp(-z)))
data <- data.frame(x1 = x1, x2 = x2, x3 = x3, x4 = x4, y = y)

# Correlations - none seems suspicious
cor(data)

# Abnormal generalized variance inflation factors: largest for x4, we remove it
modMultiCo <- glm(y ~ x1 + x2 + x3 + x4, family = "binomial")
vif(modMultiCo)

# Without x4
modClean <- glm(y ~ x1 + x2 + x3, family = "binomial")

# Comparison
summary(modMultiCo)
summary(modClean)
r2Log(modMultiCo)
r2Log(modClean)

# Generalized variance inflation factors normal
vif(modClean)
```

```{block, type = 'rmdexercise'}
For the `Boston` dataset, do the following:

1. Compute the hit matrix and hit ratio for the regression `I(medv > 25) ~ .` (hint: do `table(medv > 25, ...)`).
2. Fit `I(medv > 25) ~ .` but now using only the first 300 observations of `Boston`, the *training dataset* (hint: use `subset`).
3. For the previous model, predict the probability of the responses and classify them into `0` or `1` in the last 206 observations, the *testing dataset* (hint: use `predict` on that subset).
4. Compute the hit matrix and hit ratio for the new predictions. Check that the hit ratio is smaller than the one in the first point. The hit ratio on the testing dataset, and not the first hit rate, is an estimator of how well the model is going to classify future observations.

```