-
Notifications
You must be signed in to change notification settings - Fork 7
/
Copy path17-chi_square_goodness_of_fit-web.Rmd
718 lines (438 loc) · 27 KB
/
17-chi_square_goodness_of_fit-web.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
652
653
654
655
656
657
658
659
660
661
662
663
664
665
666
667
668
669
670
671
672
673
674
675
676
677
678
679
680
681
682
683
684
685
686
687
688
689
690
691
692
693
694
695
696
697
698
699
700
701
702
703
704
705
706
707
708
709
710
711
712
713
714
715
716
# Chi-square goodness-of-fit test
<!-- Please don't mess with the next few lines! -->
<style>h5{font-size:2em;color:#0000FF}h6{font-size:1.5em;color:#0000FF}div.answer{margin-left:5%;border:1px solid #0000FF;border-left-width:10px;padding:25px} div.summary{background-color:rgba(30,144,255,0.1);border:3px double #0000FF;padding:25px}</style>`r options(scipen=999)`<p style="color:#ffffff">`r intToUtf8(c(50,46,48))`</p>
<!-- Please don't mess with the previous few lines! -->
::: {.summary}
### Functions introduced in this chapter: {-}
`chisq.test`
:::
## Introduction
In this assignment we will learn how to run the chi-square goodness-of-fit test. A chi-square goodness-of-fit test is similar to a test for a single proportion except, instead of two categories (success/failure), we now try to understand the distribution among three or more categories.
### Install new packages
There are no new packages used in this chapter.
### Download the R notebook file
Check the upper-right corner in RStudio to make sure you're in your `intro_stats` project. Then click on the following link to download this chapter as an R notebook file (`.Rmd`).
<a href = "https://vectorposse.github.io/intro_stats/chapter_downloads/17-chi_square_goodness_of_fit.Rmd" download>https://vectorposse.github.io/intro_stats/chapter_downloads/17-chi_square_goodness_of_fit.Rmd</a>
Once the file is downloaded, move it to your project folder in RStudio and open it there.
### Restart R and run all chunks
In RStudio, select "Restart R and Run All Chunks" from the "Run" menu.
## Load packages
We load the standard `tidyverse`, `janitor`, and `infer` packages and the `openintro` package for the `hsb2` data.
```{r}
library(tidyverse)
library(janitor)
library(infer)
library(openintro)
```
## Research question
We use a classic data set `mtcars` from a 1974 Motor Trend magazine to examine the distribution of the number of engine cylinders (with values 4, 6, or 8). We'll assume that this data set is representative of all cars from 1974.
In recent years, 4-cylinder vehicles and 6-cylinder vehicles have comprised about 38% of the market each, with nearly all the rest (24%) being 8-cylinder cars. (This ignores a very small number of cars manufactured with 3- or 5-cylinder engines.) Were car engines in 1974 manufactured according to the same distribution?
Here is the structure of the data:
```{r}
glimpse(mtcars)
```
Note that the variable of interest `cyl` is not coded as a factor variable. Let's convert `cyl` to a factor variable first and add it to a new data frame called `mtcars2`. (Since the levels are already called 4, 6, and 8, we do not need to specify `levels` or `labels`.) Be sure to remember to use `mtcars2` from here on out, and not the original `mtcars`.
```{r}
mtcars2 <- mtcars %>%
mutate(cyl_fct = factor(cyl))
mtcars2
```
```{r}
glimpse(mtcars2)
```
## Chi-squared
When we have three or more categories in a categorical variable, it is natural to ask how the observed counts in each category compare to the counts that we expect to see under the assumption of some null hypothesis. In other words, we're assuming that there is some "true" distribution to which we are going to compare our data. Sometimes, this null comes from substantive expert knowledge. (For example, we will be comparing the 1974 distribution to a known distribution from recent years.) Sometimes we're interested to see if our data deviates from a null distribution that predicts an equal number of observations in each category.
First of all, what is the actual distribution of cylinders in our data? Here's a frequency table.
```{r}
tabyl(mtcars2, cyl_fct) %>%
adorn_totals() %>%
adorn_pct_formatting()
```
The counts of our frequency table are the "observed" values, usually denoted by the letter $O$ (uppercase "O", which is a little unfortunate, because it also looks like a zero).
What are the expected counts? Well, since there are 32 cars, we need to multiply 32 by the percentages listed in the research question. For 4-cylinder and 6-cylinder cars, if the distribution of engines in 1974 were the same as today, there would be $32 * 0.38$ or about 12.2 cars we would expect to see in our sample that have 4-cylinder engines, and the same for 6-cylinder cars. For 8-cylinder cars, we expect $32 * 0.24$ or about 7.7 cars in our sample to have 8-cylinder engines. These "expected" counts are usually denoted by the letter $E$.
Why aren't the expected counts whole numbers? In any given data set, of course, we will see a whole number of cars with 4, 6, or 8 cylinders. However, since we're looking only at expected counts, they are the average over lots of possible sets of 32 cars under the assumption of the null. We don't need for these averages to be whole numbers.
How should the deviation between the data and the null distribution be measured? We could simply look at the difference between the observed counts and the expected counts $O - E$. However, there will be some positive values (cells where we have more than the expected number of cars) and some negative values (cells where we have fewer than the expected number of cars). These will all cancel out.
If this sounds vaguely familiar, it is because we encountered the same problem with the formula for the standard deviation. The differences $y - \bar{y}$ had the same issue. Do you recall the solution in that case? It was to square these values, making them all positive.
So instead of $O - E$, we will consider $(O - E)^{2}$. Finally, to make sure that cells with large expected values don't dominate, we divide by $E$:
$$
\frac{(O - E)^{2}}{E}.
$$
This puts each cell on equal footing. Now that we have a reasonable measure of the deviation between observed and expected counts for each cell, we define $\chi^{2}$ ("chi-squared", pronounced "kye-squared"---rhymes with "die-scared", or if that's too dark, how about "pie-shared"^[Rhyming is fun!]) as the sum of all these fractions, one for each cell:
$$
\chi^{2} = \sum \frac{(O - E)^{2}}{E}.
$$
A $\chi^{2}$ value of zero would indicate perfect agreement between observed and expected values. As the $\chi^{2}$ value gets larger and larger, this indicates more and more deviation between observed and expected values.
As an example, for our data, we calculate chi-squared as follows:
$$
\chi^{2} = \frac{(11 - 12.2)^{2}}{12.2} + \frac{(7 - 12.2)^{2}}{12.2} + \frac{(14 - 7.7)^{2}}{7.7} \approx 7.5.
$$
Or we could just do it in R with the `infer` package. To do so, we have to state explicitly the proportions that correspond to the null hypothesis. In this case, since the order of entries in the frequency table is 4-cylinder, 6-cylinder, then 8-cylinder, we need to give `infer` a vector of entries `c("4" = 0.38, "6" = 0.38, "8" = 0.24)` that represents the 38%, 38%, and 24% expected for 4, 6, and 8 cylinders respectively.
```{r}
obs_chisq <- mtcars2 %>%
specify(response = cyl_fct) %>%
hypothesize(null = "point",
p = c("4" = 0.38,
"6" = 0.38,
"8" = 0.24)) %>%
calculate(stat = "chisq")
obs_chisq
```
## The chi-square distribution
We know that even if the true distribution were 38%, 38%, 24%, we would not see exactly 12.2, 12.2, 7.7 in a sample of 32 cars. (In fact, the "true" distribution is physically impossible because these are not whole numbers!) So what kinds of numbers could we get?
Let's do a quick simulation to find out.
Under the assumption of the null, there should be a 38%, 38%, and 24% chance of seeing 4, 6, or 8 cylinders, respectively. To get a sense of the extent of sampling variability, we could use the `sample` command to see what happens in a sample of size 32 taken from a population where the true percentages are 38%, 38%, and 24%.
```{r}
set.seed(99999)
sample1 <- sample(c(4, 6, 8), size = 32, replace = TRUE,
prob = c(0.38, 0.38, 0.24))
sample1
```
```{r}
sample1 %>%
table()
```
```{r}
sample2 <- sample(c(4, 6, 8), size = 32, replace = TRUE,
prob = c(0.38, 0.38, 0.24))
sample2
```
```{r}
sample2 %>%
table()
```
```{r}
sample3 <- sample(c(4, 6, 8), size = 32, replace = TRUE,
prob = c(0.38, 0.38, 0.24))
sample3
```
```{r}
sample3 %>%
table()
```
We can calculate the chi-squared value for each of these samples to get a sense of the possibilities. The `chisq.test` command from base R is a little unusual because it requires a frequency table (generated from the `table` command) as input. We will never use the `chisq.test` command directly because we will always use `infer` to do this work. But just to see some examples:
```{r}
sample1 %>%
table() %>%
chisq.test()
```
```{r}
sample2 %>%
table() %>%
chisq.test()
```
```{r}
sample3 %>%
table() %>%
chisq.test()
```
##### Exercise 1 {-}
Look more carefully at the three random samples above. Why does sample 1 have a chi-squared closer to 0 while samples 2 and 3 have a chi-squared values that are a little larger? (Hint: look at the counts of 4s, 6s, and 8s in those samples. How do those counts compare to the expected number of 4s, 6s, and 8s?)
::: {.answer}
Please write up your answer here.
:::
*****
The `infer` pipeline below (the `generate` command specifically) takes the values "4", "6", or "8" and grabs them at random according to the probabilities specified until it has 32 values. In other words, it will randomly select "4" about 38% of the time, "6" about 38% of the time, and "8" about 24% of the time, until it gets a list of 32 total cars. Then it will calculate the chi-squared value for that simulated set of 32 cars. But because randomness is involved, the simulated samples are subject to sampling variability and the chi-square values obtained will differ from each other. This is exactly what we did above with the `sample` command and the `chisq` command, but the benefit now is that we get 1000 random samples very quickly.
```{r}
set.seed(99999)
cyl_test_sim <- mtcars2 %>%
specify(response = cyl_fct) %>%
hypothesize(null = "point",
p = c("4" = 0.38,
"6" = 0.38,
"8" = 0.24)) %>%
generate(reps = 1000, type = "draw") %>%
calculate(stat = "chisq")
cyl_test_sim
```
The "stat" column above contains 1000 random values of $\chi^{2}$. Let's graph these values and include the chi-squared value for our actual data in the same graph.
```{r}
cyl_test_sim %>%
visualize() +
shade_p_value(obs_chisq, direction = "greater")
```
A few things are apparent:
1. The values are all positive. (The leftmost bar is sitting at 0, but it represents values greater than zero.) This makes sense when you remember that each piece of the $\chi^{2}$ calculation was positive. This is different from our earlier simulations that looked like normal models. (Z scores can be positive or negative, but not $\chi^{2}$.)
2. This is a severely right-skewed graph. Although most values are near zero, the occasional unusual sample can have a large value of $\chi^{2}$.
3. You can see that our sample (the red line) is pretty far to the right. It is an unusual value given the assumption of the null hypothesis. In fact, we can count the proportion of sampled values that are to the right of the red line:
```{r}
cyl_test_sim %>%
get_p_value(obs_chisq, direction = "greater")
```
This is the simulated P-value. Keep this number in mind when we calculate the P-value using a sampling distribution model below.
## Chi-square as a sampling distribution model
Just like there was a mathematical model for our simulated data before (the normal model back then), there is also a mathematical model for this type of simulated data. It's called (not surprisingly) the *chi-square distribution*.
There is one new idea, though. Although all normal models have the same bell shape, there are many different chi-square models. This is because the number of cells can change the sampling distribution. Our engine cylinder example has three cells (corresponding to the categories "4", "6", and "8"). But what if there were 10 categories? The shape of the chi-square model would be different.
The terminology used by statisticians to distinguish these models is *degrees of freedom*, abbreviated $df$. The reason for this name and the mathematics behind it are somewhat technical. Suffice it to say for now that if there are $c$ cells, you use $c - 1$ degrees of freedom. For our car example, there are 3 cylinder categories, so $df = 2$.
Look at the graph below that shows the theoretical chi-square models for varying degrees of freedom.
```{r}
# Don't worry about the syntax here.
# You won't need to know how to do this on your own.
ggplot(data.frame(x = c(0, 20)), aes(x)) +
stat_function(fun = dchisq, args = list(df = 2),
aes(color = "2")) +
stat_function(fun = dchisq, args = list(df = 5),
aes(color = "5" )) +
stat_function(fun = dchisq, args = list(df = 10),
aes(color = "10")) +
scale_color_manual(name = "df",
values = c("2" = "red",
"5" = "blue",
"10" = "green"),
breaks = c("2", "5", "10"))
```
The red curve (corresponding to $df = 2$) looks a lot like our simulation above. But as the degrees of freedom increase, the mode shifts further to the right.
## Chi-square goodness-of-fit test
The formal inferential procedure for examining whether data from a categorical variable fits a proposed distribution in the population is called a *chi-square goodness-of-fit test*.
We can use the chi-square model as the sampling distribution as long as the sample size is large enough. This is checked by calculating that the expected cell counts (not the observed cell counts!) are at least 5 in each cell.
The following `infer` pipeline will run a hypothesis test using the theoretical chi-squared distribution with 2 degrees of freedom.
```{r}
cyl_test <- mtcars2 %>%
specify(response = cyl_fct) %>%
assume(distribution = "chisq")
cyl_test
```
Here is the theoretical distribution:
```{r}
cyl_test %>%
visualize()
```
And here it is will our test statistic (the chi-squared value for our observed data) marked:
```{r}
cyl_test %>%
visualize() +
shade_p_value(obs_chisq, direction = "greater")
```
Finally, here is the P-value associated with the shaded area to the right of the test statistic:
```{r}
cyl_test %>%
get_p_value(obs_chisq, direction = "greater")
```
Note that this P-value is quite similar to the P-value derived from the simulation earlier.
We'll walk through the engine cylinder example from top to bottom using the rubric. Most of this is just repeating work we've already done, but showing this work in the context of the rubric will help you as you take over in the "Your Turn" section later.
## Exploratory data analysis
### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure.
Type `?mtcars` at the Console to read the help file. *Motor Trend* is a reputable publication and, therefore, we do not doubt the accuracy of the data. It's not clear, however, why these specific 32 cars were chosen and if they reflect a representative sample of cars on the road in 1974.
```{r}
mtcars
```
```{r}
glimpse(mtcars)
```
### Prepare the data for analysis.
```{r}
# Although we've already done this above,
# we include it here again for completeness.
mtcars2 <- mtcars %>%
mutate(cyl_fct = factor(cyl))
mtcars2
```
```{r}
glimpse(mtcars2)
```
### Make tables or plots to explore the data visually.
```{r}
tabyl(mtcars2, cyl_fct) %>%
adorn_totals() %>%
adorn_pct_formatting()
```
## Hypotheses
### Identify the sample (or samples) and a reasonable population (or populations) of interest.
The sample is a set of `r NROW(mtcars2)` cars from a 1974 Motor Trends magazine. The population is all cars from 1974.
### Express the null and alternative hypotheses as contextually meaningful full sentences.
$H_{0}:$ In 1974, the proportion of cars with 4, 6, and 8 cylinders was 38%, 38%, and 24%, respectively.
$H_{A}:$ In 1974, the proportion of cars with 4, 6, and 8 cylinders was not 38%, 38%, and 24%.
### Express the null and alternative hypotheses in symbols (when possible).
$H_{0}: p_{4} = 0.38, p_{6} = 0.38, p_{8} = 0.24$
There is no easy way to express the alternate hypothesis in symbols because any deviation in any of the categories can lead to rejection of the null. You can't just say $p_{4} \neq 0.38, p_{6} \neq 0.38, p_{8} \neq 0.24$ because one of these categories might have the correct proportion with the other two different and that would still be consistent with the alternative hypothesis.
So the only requirement here is to express the null in symbols.
## Model
### Identify the sampling distribution model.
We use a $\chi^{2}$ model with 2 degrees of freedom.
Commentary: Unlike the normal model, there are infinitely many different $\chi^{2}$ models, so you have to specify the degrees of freedom when you identify it as the sampling distribution model.
### Check the relevant conditions to ensure that model assumptions are met.
* Random
- We do not know how Motor Trends magazine sampled these 32 cars, so we're not sure if this list is random or representative of all cars from 1974. We should be cautious in our conclusions.
* 10%
- As long as there are at least 320 different car models, we are okay. This sounds like a lot, so this condition might not quite be met. Again, we need to be careful. (Also note that the population is not all automobiles manufactured in 1974. It is all *types* of automobile manufactured in 1974. There's a big difference.)
* Expected cell counts
- This condition says that under the null, we should see at least 5 cars in each category. The expected counts are $32(0.38) = 12.2$, $32(0.38) = 12.2$, and $32(0.24) = 7.7$. So this condition is met.
Commentary: The expected counts condition is necessary for using the theoretical chi-squared distribution. If we were using simulation instead, we would not need this condition.
## Mechanics
### Compute the test statistic.
```{r}
obs_chisq <- mtcars2 %>%
specify(response = cyl_fct) %>%
hypothesize(null = "point",
p = c("4" = 0.38,
"6" = 0.38,
"8" = 0.24)) %>%
calculate(stat = "chisq")
obs_chisq
```
### Report the test statistic in context (when possible).
The value of $\chi^{2}$ is `r obs_chisq %>% pull(1)`.
Commentary: The $\chi^{2}$ test statistic is, of course, the same value we computed manually by hand earlier. Also, the formula for $\chi^{2}$ is a complicated function of observed and expected values, making it difficult to say anything about this number in the context of cars and engine cylinders. So even though the requirement is to "report the test statistic in context," there's not much one can say here other than just to report the test statistic.
### Plot the null distribution.
```{r}
cyl_test <- mtcars2 %>%
specify(response = cyl_fct) %>%
assume(distribution = "chisq")
cyl_test
```
```{r}
cyl_test %>%
visualize() +
shade_p_value(obs_chisq, direction = "greater")
```
Commentary: We will use the theoretical distribution
### Calculate the P-value.
```{r}
cyl_test_p <- cyl_test %>%
get_p_value(obs_chisq, direction = "greater")
cyl_test_p
```
### Interpret the P-value as a probability given the null.
The P-value is `r cyl_test_p`. If the true distribution of cars in 1974 were 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder, there would be a `r 100 * cyl_test_p`% chance of seeing data at least as extreme as what we saw.
## Conclusion
### State the statistical conclusion.
We reject the null.
### State (but do not overstate) a contextually meaningful conclusion.
There is sufficient evidence that in 1974, the distribution of cars was not 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder.
### Express reservations or uncertainty about the generalizability of the conclusion.
As long as we restrict our attention to cars in 1974, we are pretty safe, although we are still uncertain if the sample we had was representative of all cars in 1974.
### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses.
If we made a Type I error, that would mean the true distribution of cars in 1974 was 38% 4-cylinder, 38% 6-cylinder, and 24% 8-cylinder, but our sample showed otherwise.
## Confidence interval
There is no confidence interval for a chi-square test. Since our test is not about measuring some parameter of interest (like $p$ or $p_{1} - p_{2}$), there is no interval to produce.
## Your turn
Use the `hsb2` data and determine if the proportion of high school students who attend general programs, academic programs, and vocational programs is 15%, 60%, and 25% respectively.
The rubric outline is reproduced below. You may refer to the worked example above and modify it accordingly. Remember to strip out all the commentary. That is just exposition for your benefit in understanding the steps, but is not meant to form part of the formal inference process.
Another word of warning: the copy/paste process is not a substitute for your brain. You will often need to modify more than just the names of the data frames and variables to adapt the worked examples to your own work. Do not blindly copy and paste code without understanding what it does. And you should **never** copy and paste text. All the sentences and paragraphs you write are expressions of your own analysis. They must reflect your own understanding of the inferential process.
**Also, so that your answers here don't mess up the code chunks above, use new variable names everywhere.**
##### Exploratory data analysis {-}
###### Use data documentation (help files, code books, Google, etc.) to determine as much as possible about the data provenance and structure. {-}
::: {.answer}
Please write up your answer here
```{r}
# Add code here to print the data
```
```{r}
# Add code here to glimpse the variables
```
:::
###### Prepare the data for analysis. [Not always necessary.] {-}
::: {.answer}
```{r}
# Add code here to prepare the data for analysis.
```
:::
###### Make tables or plots to explore the data visually. {-}
::: {.answer}
```{r}
# Add code here to make tables or plots.
```
:::
##### Hypotheses {-}
###### Identify the sample (or samples) and a reasonable population (or populations) of interest. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express the null and alternative hypotheses as contextually meaningful full sentences. {-}
::: {.answer}
$H_{0}:$ Null hypothesis goes here.
$H_{A}:$ Alternative hypothesis goes here.
:::
###### Express the null and alternative hypotheses in symbols (when possible). {-}
::: {.answer}
$H_{0}: math$
$H_{A}: math$
:::
##### Model {-}
###### Identify the sampling distribution model. {-}
::: {.answer}
Please write up your answer here.
:::
###### Check the relevant conditions to ensure that model assumptions are met. {-}
::: {.answer}
Please write up your answer here. (Some conditions may require R code as well.)
:::
##### Mechanics {-}
###### Compute the test statistic. {-}
::: {.answer}
```{r}
# Add code here to compute the test statistic.
```
:::
###### Report the test statistic in context (when possible). {-}
::: {.answer}
Please write up your answer here.
:::
###### Plot the null distribution. {-}
::: {.answer}
```{r}
# Add code here to plot the null distribution.
```
:::
###### Calculate the P-value. {-}
::: {.answer}
```{r}
# Add code here to calculate the P-value.
```
:::
###### Interpret the P-value as a probability given the null. {-}
::: {.answer}
Please write up your answer here.
:::
##### Conclusion {-}
###### State the statistical conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### State (but do not overstate) a contextually meaningful conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Express reservations or uncertainty about the generalizability of the conclusion. {-}
::: {.answer}
Please write up your answer here.
:::
###### Identify the possibility of either a Type I or Type II error and state what making such an error means in the context of the hypotheses. {-}
::: {.answer}
Please write up your answer here.
:::
## Bonus section: residuals
The chi-square test can tell us if there is some difference from the expected distribution of counts across the categories, but it doesn't tell us which category has a higher or lower count than expected. For that, we'll need to turn to another tool: *residuals*.
For technical reasons, the `infer` package doesn't provide residuals, so we'll have to turn to slightly different tools. Here's how this works; we'll return to the example of distribution of cars across the different categories of number of cylinders.
The function we'll use is called `chisq.test`. It requires us to give it input in the form of a table of counts, together with the proportions we wish to compare to:
```{r}
table(mtcars2$cyl_fct) %>%
chisq.test(p = c(.38, .38, .24)) -> cyl_chisq.test
cyl_chisq.test
```
Notice that the chi-squared value 7.5011 and the p-value 0.0235 are the same as those we calculated using `infer` tools above.
Here's how to obtain the table of residuals:
```{r}
cyl_chisq.test$residuals
```
What do these numbers mean in the real world? Not much. (Essentially, they are the values that were squared to become the individual cell contributions to the overall chi-squared score of the table.)
What we'll do with them is look for the most positive and most negative values.
- We see that the 8-cylinder column has the most positive value: this means that the number of 8-cylinder cars in 1974 was substantially *higher* than we expected.
- We see that the 6-cylinder column has the most negative value: this means that the number of 6-cylinder cars in 1974 was substantially *lower* than we expected.
### Your turn
Determine which of the high school program types is the most substantially overrepresented and the most substantially underrepresented, according to our hypothesized distribution.
::: {.answer}
```{r}
# Add code here to produce the chisq.test result.
# Add code here to examine the residuals.
```
Please write your answer here.
:::
## Conclusion
When a categorical variable has three or more categories, we can run a chi-square goodness-of-fit test to determine if the distribution of counts across those categories matches some pre-specified null hypothesis. The key new mathematical tool we need is the chi-square distribution, a way of measuring the deviation between observed counts and expected counts according to the null.
### Preparing and submitting your assignment
1. From the "Run" menu, select "Restart R and Run All Chunks".
2. Deal with any code errors that crop up. Repeat steps 1–-2 until there are no more code errors.
3. Spell check your document by clicking the icon with "ABC" and a check mark.
4. Hit the "Preview" button one last time to generate the final draft of the `.nb.html` file.
5. Proofread the HTML file carefully. If there are errors, go back and fix them, then repeat steps 1--5 again.
If you have completed this chapter as part of a statistics course, follow the directions you receive from your professor to submit your assignment.