index.qmd

---
title: "Introduction to Statistics"
author: "Nicola Rennie"
format:
  revealjs:
    theme: [custom.scss]
    auto-stretch: false
filters:
- webr
---

# Descriptive statistics {background-color="#1A936F"}

Descriptive statistics provide a summary that quantitatively describes a sample of data.

```{r}
#| label: setup
#| echo: false
#| eval: true
#| message: false
library(tidyverse)
library(emojifont)
library(showtext)
library(reactable)
library(kableExtra)
font_add_google("Ubuntu", "Ubuntu")
showtext_auto()
set.seed(1234)
population_df = tibble(ID = 1:200, 
                       x = rep(1:20, times = 10),
                       y = rep(1:10, each = 20),
                       Value = rpois(200, 250))
sample_size = 10
sample_ids = sample(1:200, size = sample_size, replace = FALSE)
sample_df =  filter(population_df, ID %in% sample_ids)
```

## Population

**Population** refers to the entire group of individuals that we want to draw conclusions about. 

```{r}
#| label: pop-people
#| eval: true
#| echo: false
#| fig-align: center
#| fig-height: 4.16
ggplot() +
  geom_text(data = population_df,
            mapping = aes(x = x,
                          y = y,
                          label = fontawesome('fa-user'),
                          colour = Value),
            family='fontawesome-webfont', size = 20) +
  scale_colour_gradient(low = "#baded3", high = "#12664d") +
  labs(title = "Population: 200 people") +
  theme_void() +
  theme(legend.position = "none", 
        legend.title = element_blank(),
        plot.margin = margin(10, 10, 10, 10), 
        plot.title = element_text(face = "bold",
                                  hjust = 0.5,
                                  family = "Ubuntu",
                                  size = 36,
                                  margin = margin(b = 10)))

```

## Sample

**Sample** refers to the (usually smaller) group of people for which we have collected data on.

```{r}
#| label: samp-people
#| eval: true
#| echo: false
#| fig-align: center
#| fig-height: 4.16
ggplot() +
  geom_text(data = population_df,
            mapping = aes(x = x,
                          y = y,
                          label = fontawesome('fa-user')),
            family='fontawesome-webfont', size = 20, colour = "grey") +
  geom_text(data = sample_df,
            mapping = aes(x = x,
                          y = y,
                          label = fontawesome('fa-user'),
                          colour = Value),
            family='fontawesome-webfont', size = 20) +
  scale_colour_gradient(low = "#baded3", high = "#12664d") +
  labs(title = glue::glue("Sample: {sample_size} people")) +
  theme_void() +
  theme(legend.position = "none", 
        legend.title = element_blank(),
        plot.margin = margin(10, 10, 10, 10), 
        plot.title = element_text(face = "bold",
                                  hjust = 0.5,
                                  family = "Ubuntu",
                                  size = 36,
                                  margin = margin(b = 10)))

```

## Generate sample data {.scrollable}

For the examples later, let's create a population of data in R...:

```{webr-r}
# Generate population data
set.seed(1234)
population = rpois(200, 250)
print("Population generated!")
```

## Generate sample data {.scrollable}

... and draw a sample from it:

```{webr-r}
# Pick a sample
set.seed(1234)
sample_size = 10
sample_data = sample(population, size = sample_size, replace = FALSE)
print("You've created a sample of data!")
```

::: {.fragment}

What do the values look like?

```{webr-r}
sample_data
```

:::

## Mean

The mean, often simply called the *average*, is defined as *the sum of all values divided by the number of values*. It's a measure of central tendency that tells us what's happening near the middle of the data.

::::{style='text-align: center;'}

$\bar{x} = \frac{1}{n} \sum_{i=i}^{n} x_{i}$

::::

::: {.fragment}

In R, we use the `mean()` function:

```{webr-r}
# Calculate mean
mean(sample_data)
```

:::

## Median

The median of a dataset is the middle value when the data is arranged in ascending order, or the average of the two middle values if the dataset has an even number of observations.

::: {.fragment}

In R, we use the `median()` function:

```{webr-r}
# Calculate median
median(sample_data)
```

:::

## Mode

The mode statistic represents the value that appears most frequently in a dataset.

::: {.fragment}

In R, there is no `mode()` function. Instead, we count how many of each value there are and choose the one with the highest number:

```{webr-r}
# Count, sort and extract first element
names(sort(table(sample_data), decreasing = TRUE)[1])
```

:::

## Range

The range is the difference between the maximum and minimum values in a dataset.

::: {.fragment}

In R, we can use the `max()` and `min()` function and subtract the values:

```{webr-r}
# Subtract max and min values
max(sample_data) - min(sample_data)
```

Note that the `range()` function returns the minimum and maximum, not a single value:

```{webr-r}
# Calculate range
range(sample_data)
```

:::

## Sample variance

The sample variance tells us about how spread out the data is. A lower variance indicates that values tend to be close to the mean, and a higher variance indicates that the values are spread out over a wider range.

::::{style='text-align: center;'}

$s^2 = \frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}$

::::

::: {.fragment}

In R, we use the `var()` function:

```{webr-r}
# Calculate variance
var(sample_data)
```

:::

## Sample standard deviation

The sample standard deviation is the square root of the variance. It also tells us about how spread out the data is. 

::::{style='text-align: center;'}

$s = \sqrt{\frac{\Sigma_{i= 1}^{N} (x_i - \bar{x})^2}{n-1}}$

::::

::: {.fragment}

In R, we use the `sd()` function:

```{webr-r}
# Calculate standard deviation
sd(sample_data)
```

:::

## Descriptive statistics {.smaller}

Descriptive statistics provide a summary that quantitatively describes a sample of data.

* Mean: The sum of the values divided by the number of values.
* Median: The middle value of the data when it's sorted.
* Mode: The value that appears most frequently.
* Range: The difference between the maximum and minimum values.
* Variance: The average of the squared differences from the mean.
* Standard deviation: The square root of the variance.

## Exercise

In R:

* Load the `ames` housing data set using `data(ames, package = "modeldata")`
* Calculate the mean, median, mode, range, variance, and standard deviation of house prices (the `Sale_Price` column).

> Remember: you can extract a column in R using `dataset$column_name`.

## Exercise solutions

```{r}
#| echo: true
# load data
data(ames, package = "modeldata")

# summary statistics
mean(ames$Sale_Price)
median(ames$Sale_Price)
names(sort(table(ames$Sale_Price), decreasing = TRUE)[1])
max(ames$Sale_Price) - min(ames$Sale_Price)
var(ames$Sale_Price)
sd(ames$Sale_Price)
```

# Questions? {background-color="#1A936F"}