practical_3.Rmd

---
title: "Practical_3"
author: "Pietro Franceschi"
date: "03/10/2019"
output: html_document
---

## Loads

```{r}
## libraries
library(tidyverse)
library(readr)
```


```{r}
## Olympic games data
athl <- read_csv("data/athlete_events.csv")
```

How many specialities so we have?

```{r}
## number of specialities
length(unique(athl$Event))

## number of athlets
nrow(athl)
```

```{r}
length(unique(athl$Name))
```


In athletics, can we see a change in the age of the athlets over time?

```{r}
## let's get the list of the sports
unique(athl$Sport)
```


```{r}
athl %>% 
  filter(Sport == "Athletics") %>% 
  ggplot() + 
  geom_jitter(aes(x = Year, y = Age, col = Sex), width = 0.2, size = 0.1) +
  scale_color_brewer(palette = "Set1", name = "Gender") + 
  facet_wrap(~Sex, ncol = 1) + 
  theme_light()
  
```

Which, by the way, gives already some interesting result ... apparently the age of participation is rising mainly in females


```{r}
athl %>% 
  filter(Sport == "Athletics") %>% 
  filter(Medal %in% c("Gold","Silver","Bronze")) %>% 
  ggplot() + 
  geom_jitter(aes(x = Year, y = Age, col = Sex), width = 0.2, size = 0.1) +
  geom_smooth(aes(x = Year, y = Age), method = lm) + 
  scale_color_brewer(palette = "Set1", name = "Gender") + 
  facet_wrap(~Sex, ncol = 1) + 
  theme_light()
```

## Grouping and Summarising

As an easy starting point we start from the iris dataset

```{r}
## get the data
data(iris)
```

Now we group it ...

```{r}
iris %>% 
  group_by(Species)
```

Apparently nothing has changed ...

```{r}
## This performs "on the fly" the mean per group! 

mytable <- iris %>% 
  group_by(Species) %>% 
  summarise(sep = mean(Sepal.Length))

## remember if we not assign the pipe to something the output will be only printed!
```


```{r}
## This performs "on the fly" the mean and sample counts per group! 
iris %>% 
  group_by(Species) %>% 
  summarise(sep = mean(Sepal.Length), 
            nsamp = length(Sepal.Length))
```

Multiple columns can be summarised by using `summarise_at`

```{r}
iris %>% 
  group_by(Species) %>% 
  summarise_at(vars(`Sepal.Length`:`Petal.Width`),
               .funs = list(sep = mean, 
                            nsamp = length))
```

This type of writing ca be challenging ... the alternative is to rely on pivot_longer ;-)


```{r}
iris %>% 
  pivot_longer(-Species, names_to = "parameter", values_to = "value") %>% 
  group_by(parameter,Species) %>% 
  summarise(mean = mean(value), 
            sd = sd(value))
  
```


Or, even more intuitive, within "summarise" keep adding columns...
```{r}
iris %>% 
  group_by(Species) %>% 
  summarise(mean.sl=mean(Sepal.Length), mean.sw=mean(Sepal.Width))
```


Rank all athletes based on the # of gold metals
```{r}
athl %>% 
  filter(Season == "Summer") %>% 
  filter(Medal %in% c("Gold")) %>% 
  group_by(Year,Name,Sport) %>% 
  summarise(medals = length(Medal)) %>% 
  ungroup() %>% 
  filter(medals > 3) %>% 
  arrange(desc(medals))
```

Plotting the average age for the athlets over the years ...

```{r}
athl %>% 
  filter(Sport == "Athletics") %>% 
  filter(Sex == "M") %>% 
  group_by(Year) %>% 
  summarise(avg_age = mean(Age, na.rm = TRUE)) %>% 
  ungroup() %>% 
  ggplot() + 
  geom_line(aes(x = Year, y = avg_age), col = "steelblue") + 
  geom_point(aes(x = Year, y = avg_age), col = "orange", alpha = 0.5, size = 3) + 
  ylab("Average Athlete Age") + 
  theme_light()
```


## Code Hints for Assignment #9

```{r}
## If you calculate the mean of a vector containing NAs, you will get NA by default
## to exclude the NAs you have to add an argument to the mean() function
mean(c(1,4,5,6,NA), na.rm = TRUE)
```


```{r}
## The list of events is quite long ... to get the "right" name for an event you need
eventlist <- unique(athl$Event)

length(eventlist)
```

```{r}
## the function grep("string",vector) tells you the index of the positions where "string" 
## is present in the vector

id100 <- grep("400",eventlist)
id100


## prints the event names which contains "400"
eventlist[id100]

```


```{r}
focusevent <- c("Athletics Men's 400 metres", "Athletics Men's 200 metres")
```