-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path20-inference_for_paired_data-web.Rmd
544 lines (305 loc) · 18.8 KB
/
20-inference_for_paired_data-web.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
# Inference for paired data
<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->
::: {.summary}
### Functions introduced in this chapter {-}
No new R functions are introduced here.
:::
## Introduction
In this chapter we will learn how to run inference for two paired numerical variables.
### Install new packages
There are no new packages used in this chapter.
### Download the R notebook file
Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).
<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/20-inference_for_paired_data.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/20-inference_for_paired_data.Rmd</a>
Once the file is downloaded, move it to your project folder in RStudio and open it there.
### Restart R and run all chunks
In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.
## Load packages
We load the standard `tidyverse` and `infer` packages. The `openintro` package will give access to the `textbooks` data and the `hsb2` data.
```{r}
library(tidyverse)
library(infer)
library(openintro)
```
## Paired data
Sometimes data sets have two numerical variables that are related to each other. For example, a diet study might include a pre-weight and a post-weight. The research question is not about either of these variables directly, but rather the difference between the variables, for example how much weight was lost during the diet.
When this is the case, we run inference for paired data. The procedure involves calculating a new variable `d` that represents the difference of the two paired variables. The null hypothesis is almost always that there is no difference between the paired variables, and that translates into the statement that the average value of `d` is zero.
## Research question
The `textbooks` data frame (from the `openintro` package) has data on the price of books at the UCLA bookstore versus Amazon.com. The question of interest here is whether the campus bookstore charges more than Amazon.
## Inference for paired data
The key idea is that we don't actually care about the book prices themselves. All we care about is if there is a difference between the prices for each book. These are not two independent variables because each row represents a single book. Therefore, the two measurements are "paired" and should be treated as a single numerical variable of interest, representing the difference between `ucla_new` and `amaz_new`.
Since we're only interested in analyzing the one numerical variable `d`, this process is nothing more than a one-sample t test. Therefore, there is really nothing new in this chapter.
Let's go through the rubric.
## Exploratory data analysis
### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
You should type `textbooks` at the Console to read the help file. The data was collected by a person, David Diez. A quick Google search reveals that he is a statistician who graduated from UCLA. We presume he had access to accurate information about the prices of books at the UCLA bookstore and from Amazon.com at the time the data was collected.
Here is the data set:
```{r}
textbooks
```
```{r}
glimpse(textbooks)
```
The two paired variables are `ucla_new` and `amaz_new`.
### Prepare the data for analysis.
Generally, we will need to create a new variable `d` that represents the difference between the two paired variables of interest. This uses the `mutate` command that adds an extra column to our data frame. The order of subtraction usually does not matter, but we will want to keep track of that order so that we can interpret our test statistic correctly. In the case of a one-sided test (which this is), it is especially important to keep track of the order of subtraction. Since we suspect the bookstore will charge more than Amazon, let's subtract in that order. Our hunch is that it will be a positive number, on average.
```{r}
textbooks_d <- textbooks %>%
mutate(d = ucla_new - amaz_new)
textbooks_d
```
If you look closely at the tibble above, you will see that there is a column already in our data called `diff`. It is the same as the column `d` we just created. So in this case, we didn't really need to create a new difference variable. However, since most data sets do not come pre-prepared with such a difference variable, it is good to know how to make one if needed.
### Make tables or plots to explore the data visually.
Here are summary statistics, a histogram, and a QQ plot for `d`.
```{r}
summary(textbooks_d$d)
```
```{r}
ggplot(textbooks_d, aes(x = d)) +
geom_histogram(binwidth = 10, boundary = 0)
```
```{r}
ggplot(textbooks_d, aes(sample = d)) +
geom_qq() +
geom_qq_line()
```
The data is somewhat skewed to the right with one observation that might be a bit of an outlier. If the sample size were much smaller, we might be concerned about this point However, it's not much higher than other points in that right tail, and it doesn't appear that its inclusion or exclusion will change the overall conclusion much. If you are concerned that the point might alter the conclusion, run the hypothesis test twice, once with and once without the outlier present to see if the main conclusion changes.
## Hypotheses
### Identify the sample (or samples) and a reasonable population (or populations) of interest.
The sample consists of `r NROW(textbooks_d)` textbooks. The population is all textbooks that might be sold both at the UCLA bookstore and on Amazon.
### Express the null and alternative hypotheses as contextually meaningful full sentences.
$H_{0}:$ There is no difference in textbooks prices between the UCLA bookstore and Amazon.
$H_{A}:$ Textbook prices at the UCLA bookstore are higher on average than on Amazon.
Commentary: Note we are performing a one-sided test. If we are conducting our own research with our own data, we can decide whether we want to run a two-sided or one-sided test. Remember that we only do the latter when we have a strong hypothesis in advance that the difference should be clearly in one direction and not the other. In this case, it's not up to us. We have to respect the research question as it was given to us: "The question of interest here is whether the campus bookstore charges more than Amazon."
##### Exercise 1 {-}
What would the research question say if we were supposed to run a two-sided test instead? In other words, write down a slightly different research question about textbook prices that would prompt us to run a two-sided test.
::: {.answer}
Please write up your answer here.
:::
### Express the null and alternative hypotheses in symbols (when possible).
$H_{0}: \mu_{d} = 0$
$H_{A}: \mu_{d} > 0$
Commentary: Since we're really just doing a one-sample t test, we could just call this parameter $\mu$, but the subscript $d$ is a good reminder that it's the mean of the difference variable we care about (as opposed to the mean price of all the books at the UCLA bookstore or the mean price of all the same books on Amazon).
## Model
### Identify the sampling distribution model.
We use a t model with 72 degrees of freedom.
##### Exercise 2 {-}
Explain how we got 72 degrees of freedom.
::: {.answer}
Please write up your answer here.
:::
### Check the relevant conditions to ensure that model assumptions are met.
* Random
- We do not know how exactly how David Diez obtained this sample, but the help file claims it is a random sample.
* 10%
- We do not know how many total textbooks were available at the UCLA bookstore at the time the sample was taken, so we do not know if this condition is met. As long as there were at least 730 books, we are okay. We suspect that, based on the size of UCLA and the number of course offerings there, this is a reasonable assumption.
* Nearly normal
- Although the sample distribution is skewed (with a possible mild outlier), the sample size is more than 30.
## Mechanics
### Compute the test statistic.
```{r}
d_mean <- textbooks_d %>%
specify(response = d) %>%
calculate(stat = "mean")
d_mean
```
```{r}
d_t <- textbooks_d %>%
specify(response = d) %>%
hypothesize(null = "point", mu = 0) %>%
calculate(stat = "t")
d_t
```
### Report the test statistic in context (when possible).
The mean difference in textbook prices is `r d_mean %>% pull(1)`.
The value of t is `r d_t %>% pull(1)`. The mean difference in textbook prices is more than 7 standard errors above a difference of zero.
### Plot the null distribution.
```{r}
price_test <- textbooks_d %>%
specify(response = d) %>%
assume("t")
price_test
```
```{r}
price_test %>%
visualize() +
shade_p_value(obs_stat = d_t, direction = "greater")
```
### Calculate the P-value.
```{r}
price_test_p <- price_test %>%
get_p_value(obs_stat = d_t, direction = "greater")
price_test_p
```
### Interpret the P-value as a probability given the null.
$P < 0.001$. If there were no difference in textbook prices between the UCLA bookstore and Amazon, there is only a `r 100 * price_test_p %>% pull(1)`% chance of seeing data at least as extreme as what we saw. (Note that the number is so small that it rounds to zero in the inline code above. That zero is technically incorrect. The P-value is never exactly zero. That's why why also are clear to state $P < 0.001$.)
## Conclusion
### State the statistical conclusion.
We reject the null hypothesis.
### State (but do not overstate) a contextually meaningful conclusion.
We have sufficient evidence that UCLA prices are higher than Amazon prices.
Commentary: Note that because we performed a one-sided test, our conclusion is also one-sided in the hypothesized direction.
### Express reservations or uncertainty about the generalizability of the conclusion.
We can be confident about the validity of this data, and therefore the conclusion drawn. We should be careful to limit our conclusion to the UCLA bookstore (and not extrapolate the findings, say, to other campus bookstores.) Depending on when this data was collected, we may not be able to say anything about current prices at the UCLA bookstore either.
### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.
If we made a Type I error, that would mean there was actually no difference in textbook prices, but that we got an unusual sample that detected a difference.
## Confidence interval
### Check the relevant conditions to ensure that model assumptions are met.
All necessary conditions have already been checked.
### Calculate and graph the confidence interval.
```{r}
price_ci <- price_test %>%
get_confidence_interval(point_estimate = d_mean, level = 0.95)
price_ci
```
```{r}
price_test %>%
visualize() +
shade_confidence_interval(endpoints = price_ci)
```
### State (but do not overstate) a contextually meaningful interpretation.
We are 95% confident that the true difference in textbook prices between the UCLA bookstore and Amazon is captured in the interval (`r price_ci$lower_ci`, `r price_ci$upper_ci`). This was obtained by subtracting the Amazon price minus the UCLA bookstore. (In other words, since all differences in the confidence interval are positive, all plausible differences indicate that the UCLA prices are higher than the Amazon prices.)
Commentary: Don't forget that any time we find a number that represents a difference, we have to be clear in the conclusion about the direction of subtraction. Otherwise, we have no idea how to interpret positive and negative values.
### If running a two-sided test, explain how the confidence interval reinforces the conclusion of the hypothesis test.
The confidence interval does not contain zero, which means that zero is not a plausible value for the difference textbook prices.
### When comparing two groups, comment on the effect size and the practical significance of the result.
To think about the practical significance, imagine that you were a student at UCLA and that every textbook you needed was (on average) $10 to $15 more expensive in the bookstore than purchasing on Amazon. Multiplied across the number of textbooks you need, that could amount to a significant increase in expenses. In other words, that dollar figure is not likely a trivial amount of money for many students who require multiple textbooks each semester.
## Your turn
The `hsb2` data set contains data from a random sample of 200 high school seniors from the "High School and Beyond" survey conducted by the National Center of Education Statistics. It contains, among other things, students' scores on standardized tests in math, reading, writing, science, and social studies. We want to know if students do better on the math test or on the reading test.
Run inference to determine if there is a difference between math scores and reading scores.
The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process.
Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should **never** copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process.
**Also, so that your answers here don't mess up the code chunks above, use new variable names everywhere.**
##### Exploratory data analysis {-}
###### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. {-}
::: {.answer}
Please write up your answer here
```{r}
# Add code here to print the data
```
```{r}
# Add code here to glimpse the variables
```
:::
###### Prepare the data for analysis. [Not always necessary.] {-}
::: {.answer}
```{r}
# Add code here to prepare the data for analysis.
```
:::
###### Make tables or plots to explore the data visually. {-}
::: {.answer}
```{r}
# Add code here to make tables or plots.
```
:::
##### Hypotheses {-}
###### Identify the sample (or samples) and a reasonable population (or populations) of interest. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express the null and alternative hypotheses as contextually meaningful full sentences. {-}
::: {.answer}
$H_{0}:$ Null hypothesis goes here.
$H_{A}:$ Alternative hypothesis goes here.
:::
###### Express the null and alternative hypotheses in symbols (when possible). {-}
::: {.answer}
$H_{0}: math$
$H_{A}: math$
:::
##### Model {-}
###### Identify the sampling distribution model. {-}
::: {.answer}
Please write up your answer here.
:::
###### Check the relevant conditions to ensure that model assumptions are met. {-}
::: {.answer}
Please write up your answer here. (Some conditions may require R code as well.)
:::
##### Mechanics {-}
###### Compute the test statistic. {-}
::: {.answer}
```{r}
# Add code here to compute the test statistic.
```
:::
###### Report the test statistic in context (when possible). {-}
::: {.answer}
Please write up your answer here.
:::
###### Plot the null distribution. {-}
::: {.answer}
```{r}
# IF CONDUCTING A SIMULATION...
set.seed(1)
# Add code here to simulate the null distribution.
```
```{r}
# Add code here to plot the null distribution.
```
:::
###### Calculate the P-value. {-}
::: {.answer}
```{r}
# Add code here to calculate the P-value.
```
:::
###### Interpret the P-value as a probability given the null. {-}
::: {.answer}
Please write up your answer here.
:::
##### Conclusion {-}
###### State the statistical conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### State (but do not overstate) a contextually meaningful conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express reservations or uncertainty about the generalizability of the conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. {-}
::: {.answer}
Please write up your answer here.
:::
##### Confidence interval {-}
###### Check the relevant conditions to ensure that model assumptions are met. {-}
::: {.answer}
Please write up your answer here. (Some conditions may require R code as well.)
:::
###### Calculate and graph the confidence interval. {-}
::: {.answer}
```{r}
# Add code here to calculate the confidence interval.
```
```{r}
# Add code here to graph the confidence interval.
```
:::
###### State (but do not overstate) a contextually meaningful interpretation. {-}
::: {.answer}
Please write up your answer here.
:::
###### If running a two-sided test, explain how the confidence interval reinforces the conclusion of the hypothesis test. [Not always applicable.] {-}
::: {.answer}
Please write up your answer here.
:::
###### When comparing two groups, comment on the effect size and the practical significance of the result. [Not always applicable.] {-}
::: {.answer}
Please write up your answer here.
:::
## Conclusion
Paired data occurs whenever we have two numerical measurements that are related to each other, whether because they come from the same observational units or from closely related ones. When our data is structured as pairs of measurements in this way, we can subtract the two columns and obtain a difference. That difference variable is the object of our study, and now that it is represented as a single numerical variable, we can apply the one-sample t test from the last chapter.
### Preparing and submitting your assignment
1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.