-
Notifications
You must be signed in to change notification settings - Fork 31
/
DATA607 Tidyverse Gabriel C.Rmd
229 lines (171 loc) · 10.6 KB
/
DATA607 Tidyverse Gabriel C.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
---
title: "Tidyverse CREATE Assignment (25 points)"
author: "Gabriel Campos"
date: "`r format(Sys.Date(), '%B %d %Y')`"
output:
html_document:
includes:
in_header: header.html
css: ./lab.css
highlight: pygments
theme: cerulean
toc: true
toc_float: true
pdf_document: default
editor_options:
chunk_output_type: console
---
```{r, echo=FALSE,warning=FALSE,message=FALSE}
library(tidyverse)
```
Assignment Requirements
=======================
**Tidyverse CREATE Assignment (25 points) **
+ Clone the provided repository (1 point) 🗸
+ Write a vignette using one TidyVerse package (15 points) 🗸
+ Write a vignette using more than one TidyVerse packages (+ 2 points) 🗸
+ Make a pull request on the shared repository (1 point)
+ Update the README.md file with your example (2 points)
+ Submit your GitHub handle name & link to Peergrade (1 point)
+ Grade your 3 peers and provide the feedback in Peergrade (2 points)
+ Submit the best peer link & your link to Blackboard (1 point)
Overview
========
The `tidyverse package` is an open source collection of packages with very applicable and useful tools for Data Science. Installing tidyverse like any other package can be done with the `install.packages()` function. The packages I will focus on is `reprex` and `ggplot` function for my assignment. Requirements to run code is `openintro package`
Load Package
=========
Loading the library after an installation can be done using the command below
```{r}
library(tidyverse)
```
```{r, include=FALSE}
Package_Details<-as.data.frame(tidyverse_packages(include_self = FALSE))
colnames(Package_Details)<-c("Package")
```
```{r, echo=FALSE}
Package_Details$Description<-c("[For summarizing statistic models using tiny bubbles](https://www.tidyverse.org/blog/2020/07/broom-0-7-0/)",
"[Suite of tools for Command Line Interface](https://cran.r-project.org/web/packages/cli/index.html)",
"[Colored terminal output](https://rdrr.io/cran/crayon/)",
"[dplyr's backend database](https://dbplyr.tidyverse.org/)",
"[Actions involving Data Manipulation](https://www.rdocumentation.org/packages/dplyr/versions/0.7.8)",
"[Suite of tools for factors](https://forcats.tidyverse.org/)",
"[Suite of tools for creating plots](https://ggplot2.tidyverse.org/)",
"[enables R to read and write various data formats](https://www.tidyverse.org/blog/2020/06/haven-2-3-0/)",
"[Used for storing durations or times](https://hms.tidyverse.org/)",
"[Wrapper for curl package](https://www.tidyverse.org/blog/2018/12/httr-1-4-0/)",
"[JSON Parser and Generator for R](https://robotwealth.com/how-to-wrangle-json-data-in-r-with-jsonlite-purr-and-dplyr/)",
"[Intuitive date-time data tools](https://lubridate.tidyverse.org/)",
"[Operators for code readability](https://magrittr.tidyverse.org/)",
"[Modeling pipeline functions](https://modelr.tidyverse.org/)",
"[Column formatting tools](https://www.rdocumentation.org/packages/pillar/versions/1.4.7)",
"[Allows for mapping functions to data](https://purrr.tidyverse.org/)",
"[For reading rectangular data](https://www.rdocumentation.org/packages/readr/versions/1.3.1)",
"[For reading data quickly from excel files](https://readxl.tidyverse.org/)",
"[Wrapper for creating snippets to post on websites and messaging apps](https://www.rdocumentation.org/packages/reprex/versions/1.0.0)",
"[For core language features of tidyverse](https://www.rdocumentation.org/packages/rlang/versions/0.2.2)",
"[For conditional access to RStudios API from CRAN](https://rstudio.github.io/rstudioapi/)",
"[Wrapper for scraping of information off of webpages](https://www.rdocumentation.org/packages/rvest/versions/0.3.6)",
"[Tools for data cleaing and preparation](https://stringr.tidyverse.org/)",
"[For dataframe creation](https://www.rdocumentation.org/packages/tibble/versions/3.0.6)",
"[To 'tidy up' or simplify data](https://tidyr.tidyverse.org/)",
"[To enhance work with HTML and XML through R](https://xml2.r-lib.org/)"
)
```
```{r,echo=FALSE}
Package_Details %>%
kableExtra::kbl() %>%
kableExtra::kable_material_dark()%>%
kableExtra::footnote(general = "TIDYVERSE PACKAGES", general_title = "A-1")
```
Reprex
------
As explained in `Table A-1` Reprex is a *Wrapper for creating snippets to post on websites and messaging apps. It's source information and details can be found below.
## Reprex Source Information:
(a) Website for `Reprex Package`: **[reprex.tidyverse.org](https://reprex.tidyverse.org/)**
(c) `Reprex` Github: **[github.com/tidyverse/reprex](https://github.com/tidyverse/reprex)**
(b) Good Tutorial for `Reprex`: **[How to use reprex](https://reprex.tidyverse.org/articles/articles/learn-reprex.html)**, **[vignettes/articles/learn-reprex.Rmd](https://github.com/tidyverse/reprex/blob/master/vignettes/articles/learn-reprex.Rmd)** \newline
\clearpage
## ggplot2
As explained in `Table A-1` ggplot2 is a suite for tools for creating plots. The data used in creating the below ggplot comes from the `openintro` package. `OpenIntro` package details can be found below.
(a) `ggplot2` website : **[rdocumentation.org/packages/ggplot2/versions/3.3.3](https://www.rdocumentation.org/packages/ggplot2/versions/3.3.3)**
(b) `ggplot` Github: **[github.com/cran/ggplot2](https://github.com/cran/ggplot2)**
### Loading data
The data used to create the plot, is the dataset `evals` from the `OpenIntro` package, noted below:\newline
OpenIntro Github: **[github.com/OpenIntroStat/openintro](https://github.com/OpenIntroStat/openintro)**
* In order to verify what packages are loaded, the command `data()` can be used
* To verify if an `OpenIntro` package directory exists on your local machine, use the command `packageDescription("openintro")`
* If it does not or the library is not available for some reason, use `install.packages("openintro")` to install `OpenIntro`.
* The command `help(package = "openintro")` can be used to access more documentation, regarding `OpenIntro`
### Step 1: Load Library
```{r, results='hide',message=FALSE}
#Load library
library(openintro)
```
### Step 2: Load Data
```{r, messages = FALSE}
## Load Dataset `evals` from `OpenIntro`
data(evals)
head(evals)
```
### Step 3: Prepare Data
Dataframe `manipulated_data` is created using specifically columns `prof_id` and `score` from `evals` data set. Data is then condensed using `group_by()` function and a new column `no_rows` is added to the dataframe as shown below
```{r, messages = FALSE}
manipulated_data<-data.frame(Professors_ID = evals$prof_id,Score = evals$score)
head(manipulated_data,3)
manipulated_data<-manipulated_data %>%
group_by(Score) %>%
summarise(no_rows = length(Score))
```
### Step 4: Plot
Plotting with ggplot2 the plot type has to be chosen with additional functions such as `geom_line`, `geom_density`, `geom_histogram()`, `geom_point()`, etc. \newline
Multiple aesthetics can be applied in one graph as well, as shown by running \newline
ggplot(data = manipulated_data,aes(x=Score, y=no_rows))+
geom_histogram(aes(x=no_rows,..density..))+
geom_density(aes(x=no_rows,..density..), color = "red", size=3)
--------------------
```{r ggplot_manipulated_data_basic,warning=FALSE,message=FALSE, include=TRUE,fig.show = "hold", out.width="50%", fig.height=4}
ggplot(data = manipulated_data, aes(x=Score, y=no_rows))+geom_line()
ggplot(data = manipulated_data, aes(x=Score, y=no_rows))+geom_density(aes(x=no_rows,..density..))
ggplot(data = manipulated_data, aes(x=Score, y=no_rows))+geom_histogram(aes(x=no_rows,..density..))
ggplot(data = manipulated_data, aes(x=Score, y=no_rows))+geom_point()
ggplot(data = manipulated_data, aes(x=Score, y=no_rows))+
geom_histogram(aes(x=no_rows,..density..))+
geom_density(aes(x=no_rows,..density..), color = "red", size=3)
```
Below would be an example of a more complex variation, utilizing `geom_text()`,`labs()`,`theme()` and `scale_x_continous()` to create a more complex plot.
```{r ggplot_manipulated_data_intermediate, echo=TRUE,warning=FALSE, include=TRUE, fig.height=4}
# Use ggplot(),geom_bar(),geom_text(),labs)(),scale_x_continous(), and theme() to edit plot
ggplot(data = manipulated_data, aes(x=Score, y=no_rows,fill=no_rows)) +
geom_bar(stat = "identity")+
geom_text(aes(label=no_rows),position = position_dodge(width = .1),vjust = -0.25)+
labs(title = 'Score Distribution',x = 'Score', y="Count")+
scale_x_continuous(breaks = unique(manipulated_data$Score)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
### Creating Snippet
The first step to utilizing `reprex` involves copying the code you would like to create a snippet of then run `reprex::reprex()`, unless you already loaded the library, in which case `reprex()` will suffice.
The example below will show how to make a snippet, out of all the steps taken to build the ggplot in chunk `ggplot_manipulated_data_intermediate`
```{r, results='hide',fig.show='hide'}
library(tidyverse)
library(openintro)
data(evals)
#head(evals)
manipulated_data<-data.frame(Professors_ID = evals$prof_id,Score = evals$score)
#head(manipulated_data,3)
manipulated_data<-manipulated_data %>%
group_by(Score) %>%
summarise(no_rows = length(Score))
ggplot(data = manipulated_data, aes(x=Score, y=no_rows,fill=no_rows)) +
geom_bar(stat = "identity")+
geom_text(aes(label=no_rows),position = position_dodge(width = .1),vjust = -0.25)+
labs(title = 'Score Distribution',x = 'Score', y="Count")+
scale_x_continuous(breaks = unique(manipulated_data$Score)) +
theme(axis.text.x = element_text(angle = 90, hjust = 1))
```
![COPY_CODE](copy_snippet.png)
![CONSOLE_OUTPUT](code_snippet_console.png)
The resulting snippet allows for an easy copy & paste with full graphics available
![Github Load Example](github_load_ex.png)
Conclusion
===========
The understanding the use of ggplot is almost a requirement in my opinion, as the complex plots are best formed utilizing this function. Reprex is also invaluable, as a way to clearly display snippets of code to others while not having to share entire file. The snippets is best when posting on public forums, but also very useful when working within a team, and just needing advice for a specific section.