-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathindex.qmd
289 lines (210 loc) · 6.89 KB
/
index.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
---
title: "Introduction to Statistics"
author: "Nicola Rennie"
format:
revealjs:
theme: [custom.scss]
auto-stretch: false
filters:
- webr
---
# Descriptive statistics {background-color="#1A936F"}
Descriptive statistics provide a summary that quantitatively describes a sample of data.
```{r}
#| label: setup
#| echo: false
#| eval: true
#| message: false
library(tidyverse)
library(emojifont)
library(showtext)
library(reactable)
library(kableExtra)
font_add_google("Ubuntu", "Ubuntu")
showtext_auto()
set.seed(1234)
population_df = tibble(ID = 1:200,
x = rep(1:20, times = 10),
y = rep(1:10, each = 20),
Value = rpois(200, 250))
sample_size = 10
sample_ids = sample(1:200, size = sample_size, replace = FALSE)
sample_df = filter(population_df, ID %in% sample_ids)
```
## Population
**Population** refers to the entire group of individuals that we want to draw conclusions about.
```{r}
#| label: pop-people
#| eval: true
#| echo: false
#| fig-align: center
#| fig-height: 4.16
ggplot() +
geom_text(data = population_df,
mapping = aes(x = x,
y = y,
label = fontawesome('fa-user'),
colour = Value),
family='fontawesome-webfont', size = 20) +
scale_colour_gradient(low = "#baded3", high = "#12664d") +
labs(title = "Population: 200 people") +
theme_void() +
theme(legend.position = "none",
legend.title = element_blank(),
plot.margin = margin(10, 10, 10, 10),
plot.title = element_text(face = "bold",
hjust = 0.5,
family = "Ubuntu",
size = 36,
margin = margin(b = 10)))
```
## Sample
**Sample** refers to the (usually smaller) group of people for which we have collected data on.
```{r}
#| label: samp-people
#| eval: true
#| echo: false
#| fig-align: center
#| fig-height: 4.16
ggplot() +
geom_text(data = population_df,
mapping = aes(x = x,
y = y,
label = fontawesome('fa-user')),
family='fontawesome-webfont', size = 20, colour = "grey") +
geom_text(data = sample_df,
mapping = aes(x = x,
y = y,
label = fontawesome('fa-user'),
colour = Value),
family='fontawesome-webfont', size = 20) +
scale_colour_gradient(low = "#baded3", high = "#12664d") +
labs(title = glue::glue("Sample: {sample_size} people")) +
theme_void() +
theme(legend.position = "none",
legend.title = element_blank(),
plot.margin = margin(10, 10, 10, 10),
plot.title = element_text(face = "bold",
hjust = 0.5,
family = "Ubuntu",
size = 36,
margin = margin(b = 10)))
```
## Generate sample data {.scrollable}
For the examples later, let's create a population of data in R...:
```{webr-r}
# Generate population data
set.seed(1234)
population = rpois(200, 250)
print("Population generated!")
```
## Generate sample data {.scrollable}
... and draw a sample from it:
```{webr-r}
# Pick a sample
set.seed(1234)
sample_size = 10
sample_data = sample(population, size = sample_size, replace = FALSE)
print("You've created a sample of data!")
```
::: {.fragment}
What do the values look like?
```{webr-r}
sample_data
```
:::
## Mean
The mean, often simply called the *average*, is defined as *the sum of all values divided by the number of values*. It's a measure of central tendency that tells us what's happening near the middle of the data.
::::{style='text-align: center;'}
$\bar{x} = \frac{1}{n} \sum_{i=i}^{n} x_{i}$
::::
::: {.fragment}
In R, we use the `mean()` function:
```{webr-r}
# Calculate mean
mean(sample_data)
```
:::
## Median
The median of a dataset is the middle value when the data is arranged in ascending order, or the average of the two middle values if the dataset has an even number of observations.
::: {.fragment}
In R, we use the `median()` function:
```{webr-r}
# Calculate median
median(sample_data)
```
:::
## Mode
The mode statistic represents the value that appears most frequently in a dataset.
::: {.fragment}
In R, there is no `mode()` function. Instead, we count how many of each value there are and choose the one with the highest number:
```{webr-r}
# Count, sort and extract first element
names(sort(table(sample_data), decreasing = TRUE)[1])
```
:::
## Range
The range is the difference between the maximum and minimum values in a dataset.
::: {.fragment}
In R, we can use the `max()` and `min()` function and subtract the values:
```{webr-r}
# Subtract max and min values
max(sample_data) - min(sample_data)
```
Note that the `range()` function returns the minimum and maximum, not a single value:
```{webr-r}
# Calculate range
range(sample_data)
```
:::
## Sample variance
The sample variance tells us about how spread out the data is. A lower variance indicates that values tend to be close to the mean, and a higher variance indicates that the values are spread out over a wider range.
::::{style='text-align: center;'}
$s^2 = \frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}$
::::
::: {.fragment}
In R, we use the `var()` function:
```{webr-r}
# Calculate variance
var(sample_data)
```
:::
## Sample standard deviation
The sample standard deviation is the square root of the variance. It also tells us about how spread out the data is.
::::{style='text-align: center;'}
$s = \sqrt{\frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}}$
::::
::: {.fragment}
In R, we use the `sd()` function:
```{webr-r}
# Calculate standard deviation
sd(sample_data)
```
:::
## Descriptive statistics {.smaller}
Descriptive statistics provide a summary that quantitatively describes a sample of data.
* Mean: The sum of the values divided by the number of values.
* Median: The middle value of the data when it's sorted.
* Mode: The value that appears most frequently.
* Range: The difference between the maximum and minimum values.
* Variance: The average of the squared differences from the mean.
* Standard deviation: The square root of the variance.
## Exercise
In R:
* Load the `ames` housing data set using `data(ames, package = "modeldata")`
* Calculate the mean, median, mode, range, variance, and standard deviation of house prices (the `Sale_Price` column).
> Remember: you can extract a column in R using `dataset$column_name`.
## Exercise solutions
```{r}
#| echo: true
# load data
data(ames, package = "modeldata")
# summary statistics
mean(ames$Sale_Price)
median(ames$Sale_Price)
names(sort(table(ames$Sale_Price), decreasing = TRUE)[1])
max(ames$Sale_Price) - min(ames$Sale_Price)
var(ames$Sale_Price)
sd(ames$Sale_Price)
```
# Questions? {background-color="#1A936F"}