This repository has been archived by the owner on Dec 30, 2023. It is now read-only.
forked from Lakens/statistical_inferences
-
Notifications
You must be signed in to change notification settings - Fork 0
/
07-CI.Rmd
555 lines (394 loc) · 47.9 KB
/
07-CI.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
```{r, include = FALSE}
knitr::opts_chunk$set(error = FALSE, warning = FALSE, message = FALSE, out.width = '100%', fig.width = 8, fig.height = 5, fig.align = 'center')
backgroundcolor <- "#fffafa"
library(ggplot2)
```
# Confidence Intervals{#confint}
When we report point estimates, we should acknowledge and quantify the uncertainty in these estimates. Confidence intervals provide a way to quantify the precision of an estimate. By reporting an estimate with a confidence interval, results are reported within a range of value that contain the true value of the parameter with a desired percentage. For example, when we report an effect size estimate with a 95% confidence interval, the expectation is that the interval is wide enough such that 95% of the time the range of values around the estimate contains the true parameter value.
## Population vs. Sample
In statistics, we differentiate between the population and the sample. The population is everyone you are interested in, such as all people in the world, elderly who are depressed, or people who buy innovative products. Your sample is everyone you were able to measure from the population you are interested in. We similarly distinguish between a parameter and a statistic. A parameter is a characteristic of the population, while a statistic is a characteristic of a sample. Sometimes, you have data about your entire population. For example, we have measured the height of all the people who have ever walked on the moon. We can calculate the average height of these twelve individuals, and so we know the true parameter. We do not need inferential statistics. However, we do not know the average height of all people who have ever walked on the earth. Therefore, we need to estimate this parameter, using a statistic based on a sample. Although it is rare that a study includes the entire population, it is not impossible, as illustrated in Figure \@ref(fig:population).
```{r, population, fig.margin = FALSE, echo = FALSE, fig.cap="Example of a registry-based study in which the entire population was included in the study. From https://doi.org/10.1093/ije/dyab066"}
knitr::include_graphics("images/population.png")
```
When the entire population is measured there is no need to perform a hypothesis test. After all, there is no population to generalize to (although it is possible to argue we are still making an inference, even when the entire population is observed, because we have observed a *metaphorical population* from one of many possible worlds, see @spiegelhalter_art_2019). When data from the entire population has been collected the population effect size is known, and there is no confidence interval to compute. If the total population size is known, but not measured completely, then the confidence interval width should shrink to zero the closer a study gets to measuring the entire population. This is known as the finite population correction factor for the variance of the estimator [@kish_survey_1965]. The variance of a sample mean is $\sigma^2/n$, which for finite populations is multiplied by the finite population correction factor of the standard error:
$$FPC = \sqrt{\frac{(N - n)}{(N-1)}}$$
where *N* is the size of the population, and *n* is the size of the sample. When *N* is much larger than *n*, the correction factor will be close to 1 (and therefore this correction is typically ignored when populations are very large, even when populations are finite), and will not have a noticeable effect on the variance. When the total population is measured the correction factor is 0, such that the variance becomes 0 as well. For example, when the total population consists of 100 top athletes, and data is collected from a sample of 35 athletes, the finite population correction is $\sqrt{(100 - 35)/(100-1)}$ = `r round(sqrt((100 - 35)/(100-1)),2)`. The `superb` R package can compute population corrected confidence intervals [@cousineau_superb_2019].
## What is a Confidence Interval?
Confidence intervals are a statement about the percentage of confidence intervals that contain the true parameter value. This behavior of confidence intervals is nicely visualized on this website by Kristoffer Magnusson: <http://rpsychologist.com/d3/CI/>. In Figure \@ref(fig:cisim) We see blue dots that represent means from a sample, and that fall around a red vertical line, which represents the true value of the parameter in the population. Due to variation in the sample, the estimates do not all fall on the red line. The horizontal lines around the blue dots are the confidence intervals. By default, the visualization shows 95% confidence intervals. Most of the lines are black (which means the confidence interval overlaps with the red line indication the true population value), but some are red (indicating they do not capture the true population value). In the long run, 95% of the horizontal bars will be black, and 5% will be red.
```{r, cisim, fig.margin = FALSE, echo = FALSE, fig.cap="Series of simulated point estimates and confidence intervals"}
knitr::include_graphics("images/cisim.png")
```
We can now see what is meant by the sentence “Confidence intervals are a statement about the percentage of confidence intervals that contain the true parameter value“. In the long run, for 95% of the samples, the red line (the population parameter) is contained within the 95% confidence interval around the sample mean, and in 5% of the confidence intervals this is not true. As we will see when we turn to the formula for confidence intervals, the width of a confidence interval depends on the sample size and the standard deviation. The larger the sample size, the smaller the confidence intervals.
## The relation between confidence intervals and *p*-values {#relatCIp}
There is a direct relationship between the CI around an effect size and statistical significance of a null-hypothesis significance test. For example, if an effect is statistically significant (*p* \< 0.05) in a two-sided independent *t*-test with an alpha of .05, the 95% CI for the mean difference between the two groups will not include zero. Confidence intervals are sometimes said to be more informative than *p*-values, because they do not only provide information about whether an effect is statistically significant (i.e., when the confidence interval does not overlap with the value representing the null hypothesis), but also communicate the precision of the effect size estimate. This is true, but as mentioned in the chapter on [*p*-values](pvalue) it is still recommended to add exact *p*-values, which facilitates the re-use of results for secondary analyses [@appelbaum_journal_2018], and allows other researchers to compare the *p*-value to an alpha level they would have preferred to use [@lehmann_testing_2005].
In order to maintain the direct relationship between a confidence interval and a *p*-value it is necessary to adjust the confidence interval level whenever the alpha level is adjusted. For example, if an alpha level of 5% is corrected for three comparisons to 0.05/3 - 0.0167, the corresponding confidence interval would be a 1 - 0.0167 = 0.9833 confidence interval. Similarly, if a *p*-value is computed for a one-sided *t*-test, there is only an upper or lower limit of the interval, and the other end of the interval ranges to −∞ or ∞.
To maintain a direct relationship between an *F*-test and its confidence interval, a 90% CI for effect sizes from an *F*-test should be provided. The reason for this is explained by [Karl Wuensch](http://core.ecu.edu/psyc/wuenschk/docs30/CI-Eta2-Alpha.doc). Where Cohen’s d can take both positive and negative values, r² or η² are squared, and can therefore only take positive values. This is related to the fact that *F*-tests (as commonly used in ANOVA) are one-sided. If you calculate a 95% CI, you can get situations where the confidence interval includes 0, but the test reveals a statistical difference with a *p* < .05 (for a more mathematical explanation, see @steiger_beyond_2004). This means that a 95% CI around Cohen's *d* in an independent *t*test equals a 90% CI around η² for exactly the same test performed as an ANOVA. As a final detail, because eta-squared cannot be smaller than zero, the lower bound for the confidence interval can not be smaller than 0. This means that a confidence interval for an effect that is not statistically different from 0 has to start at 0. You report such a CI as 90% CI [.00; .XX] where the XX is the upper limit of the CI.
Confidence intervals are often used in forest plots that communicate the results from a meta-analysis. In the plot below, we see 4 rows. Each row shows the effect size estimate from one study (in Hedges’ g). For example, study 1 yielded an effect size estimate of 0.44, with a confidence interval around the effect size from 0.08 to 0.8. The horizontal black line, similarly to the visualization we played around with before, is the width of the confidence interval. When it does not touch the effect size 0 (indicated by a black vertical line) the effect is statistically significant.
```{r, meta, echo=FALSE, fig.cap="Meta-analysis of 4 studies"}
set.seed(2238)
nSims <- 4 # number of simulated studies
pop.m1 <- 106
pop.sd1 <- 15
pop.m2 <- 100
pop.sd2 <- 15
metadata <- data.frame(yi = numeric(0), vi = numeric(0))
for (i in 1:nSims) { # for each simulated study
n <- sample(30:80, 1)
x <- rnorm(n = n, mean = pop.m1, sd = pop.sd1) # produce simulated participants
y <- rnorm(n = n, mean = pop.m2, sd = pop.sd2) # produce simulated participants
metadata[i,1] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$yi
metadata[i,2] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$vi
}
result <- metafor::rma(yi, vi, data = metadata, method = "FE")
par(mar=c(4,0,4,0))
par(bg = backgroundcolor)
metafor::forest(result, top = 0)
title("Forest plot for a simulated meta-analysis")
```
We can see, based on the fact that the confidence intervals do not overlap with 0, that studies 1 and 3 were statistically significant. The diamond shape next the FE model (Fixed Effect model) is the meta-analytic effect size. Instead of using a black horizontal line, the upper limit and lower limit of the confidence interval are indicated by the left and right points of the diamond, and the center of the diamond is the meta-analytic effect size estimate. A meta-analysis calculates the effect size by combining and weighing all studies. The confidence interval for a meta-analytic effect size estimate is always narrower than that for a single study, because of the combined sample size of all studies included in the meta-analysis.
In the preceding section, we focused on examining whether the confidence interval overlapped with 0. This is a confidence interval approach to a null-hypothesis significance test. Even though we are not computing a *p*-value, we can directly see from the confidence interval whether *p* < $\alpha$. The confidence interval approach to hypothesis testing makes it quite intuitive to think about performing tests against non-zero null hypotheses [@bauer_unifying_1996]. For example, we could test whether we can reject an effect of 0.5 by examining if the 95% confidence interval does not overlap with 0.5. We can test whether an effect is *smaller* that 0.5 by examining if the 95% confidence interval falls completely *below* 0.5. We will see that this is leads to a logical extension of null-hypothesis testing where, instead of testing to reject an effect of 0, we can test whether we can reject other effects of interest in **range predictions** and [**equivalence tests**](#equivalencetest).
## The Standard Error and 95% Confidence Intervals
To calculate a confidence interval, we need the standard error. The standard error (SE) estimates the variability between sample means that would be obtained after taking several measurements from the same population. It is easy to confuse it with the standard deviation, which is the degree to which individuals within the sample differ from the sample mean. Formally, statisticians distinguish between σ and $\widehat{\sigma}$, where the hat means the value is estimated from a sample, and the lack of a hat means it is the population value – but I’ll leave out the hat, even when I’ll mostly talk about estimated values based on a sample in the formulas below. Mathematically (where σ is the standard
deviation),
$$
Standard \ Error \ (SE) = \sigma/\sqrt n
$$
The standard error of the sample will tend to zero with increasing sample size, because the estimate of the population mean will become more and more accurate. The standard deviation of the sample will become more and more similar to the population standard deviation as the sample size increases, but it will not become smaller. Where the standard deviation is a statistic that is descriptive of your sample, the standard error describes bounds on a random sampling process.
The Standard Error is used to construct confidence intervals (CI) around sample estimates, such as the mean, or differences between means, or whatever statistics you might be interested in. To calculate a confidence interval around a mean (indicated by the Greek letter mu: μ), we use the *t* distribution with the corresponding degrees of freedom (*df* : in a one-sample *t*-test, the degrees of freedom are n-1):
$$
\mu \pm t_{df, 1-(\alpha/2)} SE
$$
With a 95% confidence interval, the $\alpha$ = 0.05, and thus the critical *t*-value for the degrees of freedom for 1- $\alpha$ /2, or the 0.975th quantile is calculated. Remember that a *t*-distribution has slightly thicker tails than a Z-distribution. Where the 0.975th quantile for a Z-distribution is 1.96, the value for a *t*-distribution with for example df = 19 is 2.093. This value is multiplied by the standard error, and added (for the upper limit of the confidence interval) or subtracted (for the lower limit of the confidence interval) from the mean.
## Overlapping Confidence Intervals
Confidence intervals are often used in plots. In Figure \@ref(fig:cioverlap) below, three estimates are vizluaized (the dots), surrounded by three lines (the 95% confidence intervals). The left two dots (X and Y) represent the *means* of the independent groups X and Y on a scale from 0 to 8 (see the axis from 0-8 on the left side of the plot). The dotted lines between the two confidence intervals visualize the overlap between the confidence intervals around the means. The two confidence intervals around means in columns X and Y are commonly shown in a figure in a scientific article. The third dot, slightly larger, is the *mean difference* between X and Y, and slightly thicker line visualizes the confidence interval of this mean difference. The difference score is expressed using the axis on the right (from -3 to 5). In the plot below, the mean of group X is 3, the mean of group Y is 5.6, and the difference is 2.6. The plot is based on 50 observations per group, and the confidence interval around the mean difference ranges from 0.49 to 4.68, which is quite wide.
```{r, cioverlap, echo=FALSE, fig.cap="Means and 95% confidence intervals of two independent groups and the mean difference and it's 95% confidence interval."}
set.seed(629)
x <- rnorm(n = 50, mean = 3, sd = 5) # get sample group 1
y <- rnorm(n = 50, mean = 5, sd = 5) # get sample group 2
d <- data.frame(
labels = c("X", "Y", "Difference"),
mean = c(mean(x), mean(y), mean(y) - mean(x)),
lower = c(t.test(x)[[4]][1], t.test(y)[[4]][1], t.test(y, x)[[4]][1]),
upper = c(t.test(x)[[4]][2], t.test(y)[[4]][2], t.test(y, x)[[4]][2])
)
par(bg = backgroundcolor)
plot(NA, xlim = c(.5, 3.5), ylim = c(0, max(d$upper[1:2] + 1)), bty = "l",
xaxt = "n", xlab = "", ylab = "Mean")
points(d$mean[1:2], pch = 19)
segments(1, d$mean[1], 5, d$mean[1], lty = 2)
segments(2, d$mean[2], 5, d$mean[2], lty = 2)
axis(1, 1:3, d$labels)
segments(1:2, d$lower[1:2], 1:2, d$upper[1:2])
axis(4, seq((d$mean[1] - 3), (d$mean[1] + 5), by = 1), seq(-3, 5, by = 1))
points(3, d$mean[1] + d$mean[3], pch = 19, cex = 1.5)
segments(3, d$mean[1] + d$lower[3], 3, d$mean[1] + d$upper[3], lwd = 2)
mtext("Difference", side = 4, at = d$mean[1], line = 3)
segments(1:1, d$upper[1:1], 1:2, d$upper[1:1], lty = 3)
segments(1:1, d$lower[1:2], 1:2, d$lower[1:2], lty = 3)
text(3, 1, paste("P-value", round(t.test(x, y)$p.value, digits = 3)))
```
As mentioned earlier, when a 95% confidence interval does not contain 0, the effect is statistically different from 0. In Figure \@ref(fig:cioverlap) above the mean difference and the 95% confidence interval around it are indicaed by the 'difference' label. As the 95% confidence interval does not contain 0, the *t*-test is significant at an alpha of 0.05. The *p*-value is indicated in the plot as 0.016. Even though the two means differ statistically from each other, the confidence interval around each mean overlap. One might intuitively believe that an effect is only statistically significant if the confidence interval around the individual means do not overlap, but this is not true. The significance test is related to the confidence interval around the mean difference.
## Prediction Intervals
Even though 95% of future confidence intervals will contain the true parameter, a 95% confidence interval will not contain 95% of future individual observations. Sometimes, researchers want to predict the interval within which a single value will fall. This is called the prediction interval. It is always much wider than a confidence interval. The reason is that individual observations can vary substantially, but means of future samples (which fall within a normal confidence interval 95% of the time) will vary much less.
In Figure \@ref(fig:predictioninterval) the orange background illustrates the 95% confidence interval around the mean, and the lighter yellow background illustrates the 95% prediction interval (PI).
```{r, predictioninterval, echo=FALSE, fig.cap="A comparison of a 95% confidence interval (gold) and 95% prediction interval (yellow)."}
set.seed(1009)
n <- 20 # set sample size
nsims <- 100000 # set number of simulations
x <- rnorm(n = n, mean = 100, sd = 15) # create sample from normal distribution
# 95% Confidence Interval
ciu <- mean(x) + qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
cil <- mean(x) - qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
# 95% Prediction Interval
piu <- mean(x) + qt(0.975, df = n - 1) * sd(x) * sqrt(1 + 1 / n)
pil <- mean(x) - qt(0.975, df = n - 1) * sd(x) * sqrt(1 + 1 / n)
ggplot(as.data.frame(x), aes(x)) + # plot data
geom_rect(aes(xmin = pil, xmax = piu, ymin = 0, ymax = Inf),
fill = "gold") + # draw yellow PI area
geom_rect(aes(xmin = cil, xmax = ciu, ymin = 0, ymax = Inf),
fill = "#E69F00") + # draw orange CI area
geom_histogram(colour = "black", fill = "grey", aes(y = ..density..),
bins = 20) + xlab("Score") + ylab("frequency") +
theme_bw(base_size = 20) +
theme(plot.background = element_rect(fill = backgroundcolor)) +
theme(panel.background = element_rect(fill = backgroundcolor)) +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_blank(),
panel.grid.minor.x = element_blank()) +
geom_vline(xintercept = mean(x), linetype = "dashed", size = 1) +
coord_cartesian(xlim = c(50, 150)) +
scale_x_continuous(breaks = c(seq(50, 150, 10))) +
annotate("text", x = mean(x), y = 0.02, label = paste(
"Mean = ", round(mean(x)), "\n",
"SD = ", round(sd(x)), sep = ""), size = 6.5)
```
To calculate the prediction interval, we need a slightly different formula for the standard error, that was used for the confidence interval, namely:
$$
Standard \ Error \ (SE) = \sigma/\sqrt(1+1/n)
$$
When we rewrite the formula used for the confidence interval to $\sigma/\sqrt(1/N)$, we see the difference between a confidence interval and the prediction interval is in the “1+” which always leads to wider intervals. Prediction intervals are **wider**, because they are constructed so that they will contain **a single future value** 95% of the time, instead of the **mean**. The fact that prediction intervals are wide is a good reminder that it is difficult te predict what will happen for any single individual.
## Capture Percentages
It can be difficult to understand why a 95% confidence interval does not provide us with the interval where 95% of future means will fall. The percentage of means that falls within a single confidence interval is called the **capture percentage**. A capture percentage is not something we would ever use to make inferences about data, but it is useful to learn about capture percentages to prevent misinterpreting confidence intervals. In Figure \@ref(fig:metaci) we see two randomly simulated studies with the same sample size from the same population. The true effect size in both studies is 0, and we see that the 95% confidence intervals for both studies contain the true population value of 0. However, the two confidence intervals cover quite different ranges of effect sizes, with the confidence interval in Study 1 ranging from -0.07 to 0.48, and the confidence interval in Study 2 ranging from -0.50 to 0.06. It can not be true that in the future, 95% of the effect sizes we should expect will fall between -0.07 to 0.048, **and** 95% of the effect sizes we should expect will fall between -0.50 to 0.06.
```{r, metaci, echo=FALSE, fig.cap="Meta-analysis of 2 simulated studies from the same population."}
set.seed(15)
nSims <- 2 # number of simulated studies
pop.m1 <- 0
pop.sd1 <- 1
pop.m2 <- 0
pop.sd2 <- 1
n <- 100
metadata <- data.frame(yi = numeric(0), vi = numeric(0))
for (i in 1:nSims) { # for each simulated study
x <- rnorm(n = n, mean = pop.m1, sd = pop.sd1) # produce simulated participants
y <- rnorm(n = n, mean = pop.m2, sd = pop.sd2) # produce simulated participants
metadata[i,1] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$yi
metadata[i,2] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$vi
}
result <- metafor::rma(yi, vi, data = metadata, method = "FE")
par(mar=c(4,0,4,0))
par(bg = backgroundcolor)
metafor::forest(result, top = 0, psize = 2)
```
The only situation in which a 95% confidence interval happens to also be a 95% capture percentage is when the observed effect size in a sample happens to be exactly the same as the true population parameter. In Figure \@ref(fig:metaci) that means we would need to observe an effect of exactly 0. However, you can’t know whether your observed effect size happens to be exactly the same as the population effect size. When a sample estimate is not identical to the true population value (which is alsmost always the case) less than 95% of future effect sizes will fall within the CI from your current sample. As we have observed two studies with observed effect sizes a bit removed from the true effect size, we will find effect size estimates in future studies that fall outside the observed 95% confidence interval quite often. So, the percentage of future means that fall within a single confidence interval depends upon which single confidence interval you happened to observe. Based on simulation studies it is possible to show that on average, in the long run, a 95% CI has an 83.4% capture probability [@cumming_confidence_2006].
## Calculating Confidence Intervals around Standard Deviations.
If we calculate a standard deviation (SD) from a sample, this value is an
estimate of the true value in the population. In small samples, our estimate can be quite far off. But due to the law of large numbers, as our sample size increases, we will be measuring the standard deviation more accurately. Since the sample standard deviation is an estimate with uncertainty, we can calculate a 95% confidence interval around it.
Keeping the uncertainty in our estimate of the standard deviation in mind can be important. When researchers perform an a-priori power analysis based on an effect size of interest expressed on a raw scale, they need accurate estimates of the standard deviation when performing the power analysis. Sometimes researchers will use pilot data to get an estimate of the standard deviation. Since the estimate of the population standard deviation based on a pilot study has some uncertainty, the sample size from the a-priori power analysis contains uncertainty (see the 'Test Yourself' questions below). Use validated or existing measures for which accurate estimates of the standard deviation in your population of interest are available. And keep in mind that all estimates from a sample have uncertainty.
## Computing Confidence Intervals around Effect Sizes
Cohen [-@cohen_earth_1994] reflected on the reason confidence intervals were rarely reporting in 1994: "I suspect that the main reason they are not reported is that they are so embarrassingly large!" This might be, but another reason might have been that statistical software rarely provided confidence intervals around effect sizes in the time when Cohen wrote his article. It has become increasingly easy to report confidence intervals with the popularity of free software packages in R, even though these packages might not provide solutions for all statistical tests yet. The [Journal Article Reporting Standards](https://apastyle.apa.org/jars/quantitative) recommend to report "effect-size estimates and confidence intervals on estimates that correspond to each inferential test conducted, when possible".
One easy solution to calculating effect sizes and confidence intervals is [MOTE](https://www.aggieerin.com/shiny-server/) made by Dr. Erin Buchanan and her lab. The website comes with a full collections of tutorials, comparisons with other software packages, and demonstration videos giving accessible overviews of how to compute effect sizes and confidence intervals for a wide range of tests based on summary statistics. This means that whichever software you use to perform statistical tests, you can enter sample sizes and means, standard deviations, or test statistics to compute effect sizes and their confidence intervals. For example, the video below gives an overview of how to compute a confidence interval around Cohen's *d* for an independent *t*-test.
<iframe width="560" height="315" src="https://www.youtube.com/embed/kH3UOoFh9Ng?start=9" frameborder="0" allow="accelerometer; autoplay; encrypted-media; gyroscope; picture-in-picture" allowfullscreen></iframe>
MOTE is also available as an R package [@buchanan_mote_2017]. Although many solutions exists to compute Cohen's *d*, MOTE sets itself apart by allowing researchers to compute effect sizes and confidence intervals for many additional effect sizes, such as (partial) omega squared for between subjects ANOVA ($\omega^{2}$ and $\omega^{2}_p$), generalized omega squared for ANOVA ($\omega^{2}_G$), epsilon squared for ANOVA ($\varepsilon^{2}$) and (partial) generalized eta squared for ANOVA ($\eta^{2}_G$).
```{r MOTE}
MOTE::d.ind.t(m1 = 1.7, m2 = 2.1, sd1 = 1.01, sd2 = 0.96, n1 = 77, n2 = 78, a = .05)$estimate
```
MBESS is another R package that has a range of options to compute effect sizes and their confidence intervals [@kelley_confidence_2007]. The code below reproduces the example for MOTE above.
```{r}
MBESS::smd(Mean.1 = 1.7, Mean.2 = 2.1, s.1 = 1.01, s.2 = 0.96, n.1 = 77, n.2 = 78)
```
If you feel comfortable analyzing your data in R, the `effectsize` package offers a complete set of convenient solutions to compute effect sizes and confidence intervals [@ben-shachar_effectsize_2020].
```{r}
set.seed(33)
x <- rnorm(n = 20, mean = 0, sd = 2.5) #create sample from normal distribution
y <- rnorm(n = 200, mean = 1.5, sd = 3.5) #create sample from normal distribution
effectsize::cohens_d(x,y)
```
I am personally impressed by the way the `effectsize` package incorporates the state of the art (although I might be a bit biased). For example, after our recommendation to by default use Welch's *t*-test instead of students *t*-test [@delacre_why_2017], and based on a recent simulation study recommended to report Hedges’ $g_s^*$ as the effect size for Welch's *t*-test [@delacre_why_2021], the `effectsize` package was the first to incorporate it.
```{r}
effectsize::cohens_d(x,y, pooled_sd = FALSE)
```
Free statistical software [jamovi](https://www.jamovi.org/) and [JAPS](https://jasp-stats.org/) are strong alternatives to SPSS that (unlike SPSS) allows users to compute Cohen's *d* and the confidence interval for both independent and dependent *t*tests.
For jamovi, the ESCI module allows users to compute effect sizes and confidence intervals, and accompanies educational material that focusses more on estimation and less on testing [@cumming_introduction_2016].
```{r escijamovi, echo = FALSE, fig.cap="Output from ESCI module in jamovi."}
knitr::include_graphics("images/escijamovi.png")
```
JASP offers a wide range of frequentist and Bayesian analyses, and in addition to Cohen's *d* also allows users to compute omega squared $\omega^{2}$, the less biased version of $\eta^{2}$ [@okada_is_2013; @albers_when_2018].
```{r jasp1, fig.margin = FALSE, echo = FALSE, fig.cap="JASP menu option allows you to select Cohen's d and a CI around it."}
knitr::include_graphics("images/jaspeffectci1.png")
```
```{r jasp2, fig.margin = FALSE, echo = FALSE, fig.cap="JASP output returns Cohen's d and the confidence interval around it."}
knitr::include_graphics("images/jaspeffectci2.png")
```
## Test Yourself
**Q1**: Go to the online app by Kristoffer Magnusson:
<http://rpsychologist.com/d3/CI/>. You might want more confidence intervals to contain the true population parameter than 95%. Drag the ‘Slide me’ button to the far right, and you will see the simulation for 99% confidence intervals. Which statement is true?
A) The confidence intervals are larger, and the sample means fall closer to the true mean.
B) The confidence intervals are smaller, and the sample means fall closer to the true mean.
C) The confidence intervals are larger, and the sample means fall as close to the true mean as for a 95% confidence interval.
D) The confidence intervals are smaller, and the sample means fall as close to the true mean as for a 95% confidence interval.
**Q2**: As we could see from the formulas for confidence intervals, sample means and their confidence intervals depend on the sample size. We can change the sample size in the online app (see the setting underneath the vizualization). By default, the sample size is set to 5. Change the sample size to 50 (you can type it in). Which statement is true?
A) The larger the sample size, the larger the confidence intervals. The sample size does not influence how the sample means vary around the true population mean.
B) The larger the sample size, the smaller the confidence intervals. The sample size does not influence how the sample means vary around the true population mean.
C) The larger the sample size, the larger the confidence intervals, and the closer the sample means are to the true population mean.
D) The larger the sample size, the smaller the confidence intervals, and the closer the sample means are to the true population mean.
**Q3**: In the forest plot below, we see the effect size (indicated by the square) and the confidence interval of the effect size (indicated by the line around the effect). Which of the studies 1 to 4 in the forest plot below were statistically significant?
```{r, metaQ, echo=FALSE, fig.cap="Meta-analysis of 4 studies"}
set.seed(229)
nSims <- 4 # number of simulated studies
pop.m1 <- 106
pop.sd1 <- 15
pop.m2 <- 100
pop.sd2 <- 15
metadata <- data.frame(yi = numeric(0), vi = numeric(0))
for (i in 1:nSims) { # for each simulated study
n <- sample(30:80, 1)
x <- rnorm(n = n, mean = pop.m1, sd = pop.sd1) # produce simulated participants
y <- rnorm(n = n, mean = pop.m2, sd = pop.sd2) # produce simulated participants
metadata[i,1] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$yi
metadata[i,2] <- metafor::escalc(n1i = n, n2i = n, m1i = mean(x), m2i = mean(y), sd1i = sd(x), sd2i = sd(y), measure = "SMD")$vi
}
result <- metafor::rma(yi, vi, data = metadata, method = "FE")
par(mar=c(4,0,4,0))
par(bg = backgroundcolor)
metafor::forest(result, top = 0)
title("Forest plot for a simulated meta-analysis")
```
A) Studies 1, 2, 3, and 4
B) Only study 3
C) None of the four studies
D) Studies 1, 2 and 4
**Q4**: The light black diamond in the bottom row is the fixed effects meta-analytic effect size estimate. Instead of using a black horizontal line, the upper limit and lower limit of the confidence interval are indicated by the left and right points of the diamond. The center of the diamond is the meta-analytic effect size estimate. A meta-analysis calculates the effect size by combining and weighing all studies. Which statement is true?
A) The confidence interval for a fixed effect meta-analytic effect size estimate is always wider than that for a single study, because of the additional variation between studies.
B) The confidence interval for a fixed effect meta-analytic effect size estimate is always more narrow than that for a single study, because of the combined sample size of all studies included in the meta-analysis.
C) The confidence interval for a fixed effect meta-analytic effect size estimate does not become wider or more narrow compared to the confidence interval of a single study, it just becomes closer to the true population parameter.
**Q5**: Let’s assume a researcher calculates a mean of 7.5, and a standard deviation of 6.3, in a sample of 20 people. The critical value for a *t*-distribution with df = 19 is 2.093. Calculate the upper limit of the confidence interval around the mean using the formula below. Is it:
$$
\mu \pm t_{df, 1-(\alpha/2)} SE
$$
A) 1.40
B) 2.95
C) 8.91
D) 10.45
Copy the code below into R and run the code. It will generate plots like the one in Figure \@ref(fig:cioverlap). Run the entire script as often as you want (notice the variability in the *p*-values due to the relatively low power in the test!), to answer the following question. The *p*-value in the plot will tell you if the difference is statistically significant, and what the *p*-value is. Run the simulation until you find a *p*-value close to *p* = 0.05.
```{r, evaluate=FALSE}
x <- rnorm(n = 50, mean = 3, sd = 5) # get sample group 1
y <- rnorm(n = 50, mean = 5, sd = 5) # get sample group 2
d <- data.frame(
labels = c("X", "Y", "Difference"),
mean = c(mean(x), mean(y), mean(y) - mean(x)),
lower = c(t.test(x)[[4]][1], t.test(y)[[4]][1], t.test(y, x)[[4]][1]),
upper = c(t.test(x)[[4]][2], t.test(y)[[4]][2], t.test(y, x)[[4]][2])
)
plot(NA, xlim = c(.5, 3.5), ylim = c(0, max(d$upper[1:2] + 1)), bty = "l",
xaxt = "n", xlab = "", ylab = "Mean")
points(d$mean[1:2], pch = 19)
segments(1, d$mean[1], 5, d$mean[1], lty = 2)
segments(2, d$mean[2], 5, d$mean[2], lty = 2)
axis(1, 1:3, d$labels)
segments(1:2, d$lower[1:2], 1:2, d$upper[1:2])
axis(4, seq((d$mean[1] - 3), (d$mean[1] + 5), by = 1), seq(-3, 5, by = 1))
points(3, d$mean[1] + d$mean[3], pch = 19, cex = 1.5)
segments(3, d$mean[1] + d$lower[3], 3, d$mean[1] + d$upper[3], lwd = 2)
mtext("Difference", side = 4, at = d$mean[1], line = 3)
segments(1:1, d$upper[1:1], 1:2, d$upper[1:1], lty = 3)
segments(1:1, d$lower[1:2], 1:2, d$lower[1:2], lty = 3)
text(3, 1, paste("P-value", round(t.test(x, y)$p.value, digits = 3)))
```
**Q6**: How much do two 95% confidence intervals around individual means from independent groups overlap when the mean difference between the two means is only just statistically significant (*p* ≈ 0.05 at an alpha of 0.05)?
A) When the 95% confidence interval around one mean does not contain the mean of the other group, the groups always differ significantly from each other.
B) When the 95% confidence interval around one mean does not overlap with the 95% confidence interval of the mean of the other group, the groups always differ significantly from each other.
C) When the overlap between the two confidence intervals around each mean overlap a little bit (the upper bound of the CI overlaps with the lower quarter of the confidence interval around the other mean) the groups differ significantly from each other at approximately *p* = 0.05.
D) There is no relationship between the overlap of the 95% confidence intervals around two independent means, and the *p*-value for the difference between these groups.
Note that this visual overlap rule can only be used when the comparison is made between independent groups, not between dependent groups! The 95% confidence interval around effect sizes is therefore typically more easily interpretable in relation to the significance of a test.
Let’s experience this through simulation. The simulation in the R script below generates a large number of additional samples, after the initial one that was plotted. The simulation returns the number of CI that contains the mean (which should be 95% in the long run). The simulation also returns the % of means from future studies that fall within the 95% of the original study, or the capture percentage. It differs from (and is often lower, but sometimes higher, than) the confidence interval.
```{r, eval = FALSE}
library(ggplot2)
n <- 20 # set sample size
nsims <- 100000 # set number of simulations
x <- rnorm(n = n, mean = 100, sd = 15) # create sample from normal distribution
# 95% Confidence Interval
ciu <- mean(x) + qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
cil <- mean(x) - qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
# 95% Prediction Interval
piu <- mean(x) + qt(0.975, df = n - 1) * sd(x) * sqrt(1 + 1 / n)
pil <- mean(x) - qt(0.975, df = n - 1) * sd(x) * sqrt(1 + 1 / n)
ggplot(as.data.frame(x), aes(x)) + # plot data
geom_rect(aes(xmin = pil, xmax = piu, ymin = 0, ymax = Inf),
fill = "gold") + # draw yellow PI area
geom_rect(aes(xmin = cil, xmax = ciu, ymin = 0, ymax = Inf),
fill = "#E69F00") + # draw orange CI area
geom_histogram(colour = "black", fill = "grey", aes(y = ..density..), bins = 20) +
xlab("Score") +
ylab("frequency") +
theme_bw(base_size = 20) +
theme(panel.grid.major.x = element_blank(), axis.text.y = element_blank(),
panel.grid.minor.x = element_blank()) +
theme(plot.background = element_rect(fill = backgroundcolor)) +
theme(panel.background = element_rect(fill = backgroundcolor)) +
geom_vline(xintercept = mean(x), linetype = "dashed", size = 1) +
coord_cartesian(xlim = c(50, 150)) +
scale_x_continuous(breaks = c(seq(50, 150, 10))) +
annotate("text", x = mean(x), y = 0.02, label = paste(
"Mean = ", round(mean(x)), "\n",
"SD = ", round(sd(x)), sep = ""), size = 6.5)
# Simulate Confidence Intervals
ciu_sim <- numeric(nsims)
cil_sim <- numeric(nsims)
mean_sim <- numeric(nsims)
for (i in 1:nsims) { # for each simulated experiment
x <- rnorm(n = n, mean = 100, sd = 15) # create sample from normal distribution
ciu_sim[i] <- mean(x) + qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
cil_sim[i] <- mean(x) - qt(0.975, df = n - 1) * sd(x) * sqrt(1 / n)
mean_sim[i] <- mean(x) # store means of each sample
}
# Save only those simulations where the true value was inside the 95% CI
ciu_sim <- ciu_sim[ciu_sim < 100]
cil_sim <- cil_sim[cil_sim > 100]
# Calculate how many times the observed mean fell within the 95% CI of the original study
mean_sim <- mean_sim[mean_sim > cil & mean_sim < ciu]
cat((100 * (1 - (length(ciu_sim) / nsims + length(cil_sim) / nsims))),
"% of the 95% confidence intervals contained the true mean")
cat("The capture percentage for the plotted study, or the % of values within
the observed confidence interval from", cil, "to", ciu,
"is:", 100 * length(mean_sim) / nsims, "%")
```
**Q7**: Run the simulations multiple times. Look at the output you will get in the R console. For example: “95.077 % of the 95% confidence intervals contained the true mean” and “The capture percentage for the plotted study, or the % of values within the observed confidence interval from 88.17208 to 103.1506 is: 82.377 %”. While running the simulations multiple times, look at the confidence interval around the sample mean, and relate this to the capture percentage. Run the simulation until you have seen a range of means closer and further away from the true mean in the simulation (100). Which statement is true?
A) The farther the sample mean is from the true population mean, the lower the capture percentage.
B) The farther the sample mean is from the true population mean, the higher the capture percentage.
**Q8**: Simulations in R are randomly generated, but you can make a specific simulation reproducible by setting the seed of the random generation process. Copy-paste “set.seed(1000)” to the first line of the R script, and run the simulation. The sample mean should be 94. What is the capture percentage? (Don’t forget to remove the set.seed command if you want to generate more random simulations!).
A) 95%
B) 42.1%
C) 84.3%
D) 89.2%
Capture percentages are rarely directly used to make statistical inferences. The main reason we discuss them here is really to prevent the common misunderstanding that 95% of future means fall within a single confidence interval: Capture percentages clearly show that is not true. Prediction intervals are also rarely used in psychology, but are more common in data science.
**Q9**: If you run lines the first lines of the code below, you will see that with an alpha level of 0.05, 100 observations, and a true standard deviation of 1, the 95% CI is [0.88; 1.16]. Change the assumed population standard deviation from 1 to 2 (st_dev <- 2). Keep all other settings the same. What is the 95% CI around the standard deviation of 2 with 100 observations?
```{r, eval=FALSE}
alpha_level <- 0.05 #set alpha level
n <- 100 #set number of observations
st_dev <- 1 #set true standard deviation
effect <- 0.5 #set effect size (raw mean difference)
# calculate lower and upper critical values c_l and c_u
c_l <- sqrt((n - 1)/qchisq(alpha_level/2, n - 1, lower.tail = FALSE))
c_u <- sqrt((n - 1)/qchisq(alpha_level/2, n - 1, lower.tail = TRUE))
# calculate lower and upper confidence interval for sd
st_dev * c_l
st_dev * c_u
# d based on lower bound of the 95CI around the SD
effect/(st_dev * c_l)
# d based on upper bound of the 95CI around the SD
effect/(st_dev * c_u)
pwr::pwr.t.test(d = effect/(st_dev * c_l), power = 0.9, sig.level = 0.05)
pwr::pwr.t.test(d = effect/(st_dev * c_u), power = 0.9, sig.level = 0.05)
# Power analysis for true standard deviation for comparison
pwr::pwr.t.test(d = effect/st_dev, power = 0.9, sig.level = 0.05)
```
A) 95% CI [1.38; 3.65]
B) 95% CI [1.76; 2.32]
C) 95% CI [1.82; 2.22]
D) 95% CI [1.84; 2.20]
**Q10**: Change the assumed population standard deviation back from 2 to 1. Lower the sample size from 100 to 20 (n <- 20). This will inform us about the width of the confidence interval for a standard deviation when we run a pilot study with 20 observations. Keep all other settings the same. What is the 95% CI around the standard deviation of 1 with 20 observations?
A) 95% CI [0.91; 1.11]
B) 95% CI [0.82; 1.28]
C) 95% CI [0.76; 1.46]
D) 95% CI [1.52; 2.92]
**Q11**: If we want the 95% CI around the standard deviation of 1 to be at most 0.05 away from the assumed population standard deviation, how large should our number of observations be? Note that this means we want the 95% CI to fall within 0.95 and 1.05. But notice from the calculations above that the distribution of the sample standard deviations is not symmetrical. Standard deviations can’t be smaller than 0 (because they are the square rooted variance). So in practice the question is: What is the **smallest** number of observations for the upper 95% CI to be smaller than 1.05? Replace n with each of the values in the answer options.
A) n = 489
B) n = 498
C) n = 849
D) n = 948
Let’s explore what the consequences of an inaccurate estimate of the population standard deviation are on a-priori power analyses. Let’s imagine we want to perform an a-priori power analysis for a smallest effect size of interest of half a scale point (on a scale from 1-5) on a measure that has an (unknown) true population standard deviation of 1.2.
**Q12**: Change the number of observations to 50. Change the assumed population standard deviation to 1.2. Keep the effect as 0.5. The 95% confidence interval for the standard deviation based on a sample of 50 observation ranges from 1.002 to 1.495. To perform an a-priori power analysis we need to calculate Cohen’s d, which is the difference divided by the standard deviation. In our example, we want to at least observe a difference of 0.5. What is Cohen’s d (effect/SD) for the lower bound of the 95% confidence interval (where SD = 1.002) or the upper bound (where SD = 1.495)?
A) d = 0.33 and d = 0.50
B) d = 0.40 and d = 0.60
C) d = 0.43 and d = 0.57
D) d = 0.29 and d = 0.55
If we draw a sample of 50 observations we can happen to observe a value that, due to random variation, is much smaller or much larger than the true population value. We can examine the effect this has on the number of observations that we think will be required when we perform an a-priori power analysis.
**Q13**: An a-priori power analysis is performed that uses the estimate of Cohen’s d based on the lower 95% CI of the standard deviation. Which statement is true?
A) Because the lower bound of the 95% CI is **smaller** than the true population SD, Cohen’s d is **smaller**, and the a-priori power analysis will yield a sample size that is **smaller** than the sample size we really need.
B) Because the lower bound of the 95% CI is **smaller** than the true population SD, Cohen’s d is **larger**, and the a-priori power analysis will yield a sample size that is **larger** than the sample size we really need.
C) Because the lower bound of the 95% CI is **smaller** than the true population SD, Cohen’s d is **smaller**, and the a-priori power analysis will yield a sample size that is **larger** than the sample size we really need.
D) Because the lower bound of the 95% CI is **smaller** than the true population SD, Cohen’s d is **larger**, and the a-priori power analysis will yield a sample size that is **smaller** than the sample size we really need.
**Q14**: Let’s check if our answer on the previous question was correct. We still have an alpha level of 0.05, n = 50, a standard deviation of 1.2, and an effect of interest of 0.5. Run the power analyses using the `pwr` package. The first power analysis uses Cohen’s d based on the lower bound of the 95% confidence interval. The second power analysis uses the upper bound of the 95% confidence interval. (There is also a third power analysis based on the (in real-life situations unknown) true standard deviation, just for comparison). Which statement is true (note that the sample size for a power analysis is rounded up, as we can't collect a partial observation)?
A) The sample size per group is 68 when calculating the effect size based on the lower bound on the 95% CI around the standard deviation, and 86 when using the upper bound of the 95% CI around the standard deviation.
B) The sample size per group is 68 when calculating the effect size based on the lower bound on the 95% CI around the standard deviation, and 123 when using the upper bound of the 95% CI around the standard deviation.
C) The sample size per group is 86 when calculating the effect size based on the lower bound on the 95% CI around the standard deviation, and 123 when using the upper bound of the 95% CI around the standard deviation.
D) The sample size per group is 86 when calculating the effect size based on the lower bound on the 95% CI around the standard deviation, and 189 when using the upper bound of the 95% CI around the standard deviation.
### Open Questions
1. What is the definition of a confidence interval?
2. How is a confidence interval related to statistical significance?
3. What happens to a confidence interval when the sample size increases?
4. What is the difference between a confidence interval and a capture percentage?
5. What is a prediction interval?
6. If you have data from the entire population, do you need to calculate a confidence interval?
7. What are confidence intervals a statement about?
8. What does it mean to say that after you have collected the data, the confidence interval either contains the true parameter, or it doesn’t?
9. What is the difference between estimates from small vs. large samples?