-
Notifications
You must be signed in to change notification settings - Fork 31
/
Cassandra_C_Tidyverse_Extend.Rmd
89 lines (63 loc) · 3.94 KB
/
Cassandra_C_Tidyverse_Extend.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
---
title: "Tidyverse Extend"
author: "Cassandra Coste"
date: "4/28/2021"
output: html_document
---
### Tage N Singh TIDYVERSE Create
This vignette will provide a brief exploration of the "tidyverse" package using its built-in libraries
The tidyverse is a powerful collection of R packages that are actually data tools All packages of the tidyverse share an underlying common APIs as listed below.
Lists
- ggplot2, which implements the grammar of graphics. You can use it to visualize your data.
- dplyr is a grammar of data manipulation. You can use it to solve the most common data manipulation challenges.
- tidyr helps you to create tidy data or data where each variable is in a column, each observation is a row end each value is a cell.
- readr is a fast and friendly way to read rectangular data.
- purrr enhances R’s functional programming (FP) toolkit by providing a complete and consistent set of tools for working with functions and vectors.
- tibble is a modern re-imaginging of the data frame.
- stringr provides a cohesive set of functions designed to make working with strings as easy as posssible
- forcats provide a suite of useful tools that solve common problems with factors.
Now for the package
-------------------
```{r setup, message=FALSE, warning=FALSE}
#Loading just the "tidyverse" library !
library(tidyverse)
```
```{r tidyverse usage-1}
# For this exercise we are using a dataset from kaggle.com which contain
#information about avacado sales in various cities in the USA
# ---------- Using the "rear" function "read_csv" -------
avacado <- data.frame(read_csv(file = "https://raw.githubusercontent.com/tagensingh/SPS-DATA607-TIDYVERSE/main/avocado.csv"))
# ---------- Using the "tibble" function to look at a snapshot of the dataframe -------
tibble(avacado)
```
```{r tidyverse usage-2}
# ---------- Using the "arrange" function from "dplyr"to sort the dataframe by date-------
avacado_date <- arrange(avacado,Date)
tibble(avacado_date)
# ---------- Using the "ggplot" function from "ggplot2"to chart the pricing density of avacados-------
# Histogram overlaid with kernel density curve
ggplot(avacado, aes(x=AveragePrice)) +
geom_histogram(aes(y=..density..), # Histogram with density instead of count on y-axis
binwidth=.1,
colour="black", fill="white") +
geom_density(alpha=.1, fill="#FF6666")+# Overlay with transparent density plot
ggtitle("Avacados Pricing Density")
```
As is shown above, the tidyverse packages is one of the most versatile packages in the R sphere....
### Cassandra Coste TIDYVERSE Extend
I am going to extend on the work of Tage by exploring the group by and summarize function within dplyr with the data and visualize the results.
```{r}
avocado_region <- avacado %>% group_by(region)
```
You might notice that the group_by function won't make the data set look any different, but when we run calculations on the data, the group_by function will change the way the data is processed.
Now, when we run the summarise function to calculate the mean it will do the calculation by region. This is a helpful function since you often want to compare groups to infer meaning from datasets.
```{r}
avocado_region <- avocado_region %>% summarise(regional_avg = mean(AveragePrice))
```
Finally, using ggplot we can compare regionally, where avocado prices are on average.
```{r, fig.width=8,fig.height=12, message=FALSE, warning=FALSE}
ggplot(avocado_region, aes(x = reorder(region,-regional_avg), y = regional_avg, fill = "red")) + geom_bar(stat="identity") +
coord_flip() + ggtitle("Regional Average of Avocado Prices") + ylab("Average Price (USD)") + xlab("Region") + scale_y_continuous(labels = scales::dollar_format(), limits=c(0,2)) + theme(legend.position = "none")
```
----------------------------------------------------------------------------------------------------
© 2021 GitHub, Inc.