-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathExploration_Wine.Rmd
651 lines (476 loc) · 22.9 KB
/
Exploration_Wine.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
644
645
646
647
648
649
650
651
---
title: Red Wine Exploratory Data Analysis
output: html_document
---
*by Js Lims*
*December 26 2016*
**Contents**
* Introduction
* Summary of Data
* Univariate Section
* Bivariate Section
* Multivariate Section
* Final Plots and Summary
* Reflection
```{r echo=FALSE, message=FALSE, warning=FALSE, Load_packages}
library(dplyr)
library(ggplot2)
library(GGally)
library(gridExtra)
```
#Introduction
The purpose of this project is to use EDA(Exploratory Data Analysis) tequnique to figure out distributions, outliers, relations and any other surprising by exploring data from one variable to multiple variables.
The goal of this project is to find important variables which influence the quality of red wine.
This project is written out by using R programming.
### A brief summary of the dataset
```{r echo=FALSE, Load_Data}
wine <- read.csv("/Users/watseob/Desktop/DataScience/Project/P_Wine/wineQualityReds.csv")
str(wine)
dim(wine)
summary(wine)
```
# Univariate Plots Section
### Fixed Acidity
```{r echo=FALSE, Histogram1}
ggplot(wine, aes(x = fixed.acidity)) +
geom_histogram(binwidth = 0.2) +
ggtitle("Fixed Acidity Histogram (binwidth = 0.2)") +
xlab("Fixed Acidity (tartaric acid - g / dm^3) ")
summary(wine$fixed.acidity)
```
As seen above graph, Fixed acidity is skewed positively. The mean is between median and 3rd quartile.
### Volatile Acidity
Volatile Acidity can describe condition of wine. Appropriate volatile acidity is necessary to the scent of wine. If it is too much, the wine could go bad.
```{r echo=FALSE, Histogram2}
ggplot(wine, aes(x = volatile.acidity)) +
geom_histogram(binwidth = 0.05) +
ggtitle("Volatile Acidity Histogram (binwidth = 0.05)") +
xlab("Volatile Acidity (acetic acid - g / dm^3)")
summary(wine$volatile.acidity)
```
The distribution of volatile acidity close to normal distribution, but there is small tail on the right side of the plot. I wonder the quality of wine which is out of 3rd quartile.
### Citric Acid
```{r echo=FALSE, Histofram4}
ggplot(wine, aes(x = citric.acid)) +
geom_histogram(binwidth = 0.02) +
ggtitle("Cirtic Acid Histogram (binwidth = 0.02)") +
xlab("Citric Acid (g / dm^3)")
summary(wine$citric.acid)
```
There are three peaks in this plot.
### Residual sugar
```{r echo=FALSE, Histogram5}
ggplot(wine, aes(x = residual.sugar)) +
geom_histogram(binwidth = 0.2) +
ggtitle("Residual Sugar Histogram (binwidth = 0.2)") +
xlab("Residual Sugar (g / dm^3)")
summary(wine$residual.sugar)
```
It's postively skewed. It has long tail on the right side.
75% of wines have residual sugar below 2.6 g/dm^3.
```{r echo=FALSE, Histogram5_xlim,warning=FALSE}
ggplot(wine, aes(x = residual.sugar)) +
geom_histogram(binwidth = 0.2) +
ggtitle("Residual Sugar Histogram (binwidth = 0.2)") +
xlab("Residual Sugar (g / dm^3)") +
xlim(c(0,4))
```
After removing ouliers, residual sugar looks normaly distributed.
### Chlorides
```{r echo=FALSE, Histogram6,warning=FALSE}
ggplot(wine, aes(x = chlorides)) +
geom_histogram(binwidth = 0.01) +
ggtitle("Chlorides Histogram (bindwidth = 0.01)") +
xlab("Chloride (sodium chloride - g / dm^3)")
```
This plot looks normally distributed, but there is long tail on the right side.
I wonder effects of those outliers on quality of wine later.
```{r echo=FALSE, Histogram6_removed_outlier ,warning=FALSE}
ggplot(wine, aes(x = chlorides)) +
geom_histogram(binwidth = 0.01) +
ggtitle("Chlorides Histogram (bindwidth = 0.01)") +
xlab("Chloride (sodium chloride - g / dm^3)") +
xlim(c(0,0.15))
```
After removing outliers, we can see the distribution looks normal.
### Free sulfur dioxide
```{r echo=FALSE, Histogram7}
ggplot(wine, aes(x = free.sulfur.dioxide)) +
geom_histogram(binwidth = 1) +
ggtitle("Free sulfur dioxide Histogram (binwidth = 1)") +
xlab("Free sulfur dioxide (mg / dm^3)")
summary(wine$free.sulfur.dioxide)
```
This plot is positively skewed. Sulfur dioxide is bad for human body, I wonder
how this effects on quality of wine.
### Total Sulfur Dioxide
```{r echo=FALSE, Histogram8}
ggplot(wine, aes(x = total.sulfur.dioxide)) +
geom_histogram(binwidth = 5) +
ggtitle("Total sulfur dioxide Histogram ") +
xlab("Total sulfur dioxide (mg / dm^3)")
summary(wine$total.sulfur.dioxide)
```
Also, the plot is positively skewed. There are outliers near 300.
```{r echo=FALSE, Histogram8_log}
ggplot(subset(wine,wine$total.sulfur.dioxide<200), aes(x = total.sulfur.dioxide)) +
geom_histogram(binwidth = 0.1) +
ggtitle("Total sulfur dioxide Histogram ") +
xlab("Log(Total sulfur dioxide) (mg / dm^3)") +
scale_x_log10()
```
After remvoing outliers and log scaling, the distribution looks normal.
### Density
```{r echo=FALSE, Histogram9}
ggplot(wine, aes(x = density)) +
geom_histogram(binwidth = 0.0003) +
ggtitle("Density Histogram (binwidth = 0.3 * 10 ^-3) ") +
xlab("Density (g / cm^3)")
summary(wine$density)
```
This plot is normally distributed well. The mean and medians are fairly closed.
### pH
```{r echo=FALSE, Histogram10}
ggplot(wine, aes(x = pH)) +
geom_histogram(binwidth = 0.05) +
ggtitle("pH Histogram (binwidth = 0.05)") +
xlab("pH")
summary(wine$pH)
```
Also, the plot is normally distributed.
### Total Sulphates
```{r echo=FALSE, Histogram11}
ggplot(wine, aes(x = sulphates)) +
geom_histogram(binwidth = 0.05) +
ggtitle("Sulphates Histogram (binwidth = 0.05)") +
xlab("Sulphates (potassium sulphate - g / dm3)")
summary(wine$sulphates)
```
Sulphates variale is left skewed.
```{r echo=FALSE, Histogram11_log}
ggplot(wine, aes(x = sulphates)) +
geom_histogram(binwidth = 0.05) +
ggtitle("Sulphates Histogram (binwidth = 0.05)") +
xlab("Log(Sulphates) (potassium sulphate - g / dm3)") +
scale_x_log10()
```
With a log scale on x-axis, the distribution looks normal.
### Alcohol
```{r echo=FALSE, Histogram12}
ggplot(wine, aes(x = alcohol)) +
geom_histogram(binwidth = 0.5) +
ggtitle("Alcohol Histogram (binwidth = 0.5)") +
xlab("Alcohol (% by volume)")
summary(wine$alcohol)
```
The plot is left skewed. 75% of wines have an alcohol below 11.10%.
### Quality
```{r echo=FALSE, Histofram13}
ggplot(wine, aes(x = quality)) +
geom_bar() +
scale_x_continuous(breaks = seq(0,8,1)) +
ggtitle("Quality Barchart")
xlab("Quality ( 0 ~ 10 )")
summary(wine$quality)
```
I grouped the quality attribute as level attribute.
* Quality 3 and 4 -> low
* Quality 5 and 6 -> middle
* Quality 7 and 8 -> high
```{r echo=FALSE,warning=FALSE,message=FALSE, Histogram14}
wine$level <- cut(wine$quality, c(0,4,6,10), labels = c("low","middle","high"), include.lowest = T)
ggplot(wine, aes(x = level)) +
geom_bar() +
ggtitle("Quality Level Barchart") +
xlab("Quality (low, middle, high)")
```
Most of quality level is middle
The mean quality score is 5.636
## Univariate Analysis
##### What is the structure of your dataset?
There are 1599 observation and 13 attributes in this data set.
Except quality variable which is categorical, the variables are numeric.
##### What is/are the main feature(s) of interest in your dataset?
Quality variable is main. We need to figure out how other variables effects on main value.
##### What other features in the dataset do you think will help support your investigation into your feature(s) of interest?
As i see some ariticles about wine, flavor and scent are important to quality
of wines.
There would be many other factors effects on them and harmony of these factors would be important.
I think below variables would be support my investigation.
Total acidity, Fixed acidity, Citric acidity,Alcohol.
##### Did you create any new variables from existing variables in the dataset?
Not yet.
##### Of the features you investigated, were there any unusual distributions? Did you perform any operations on the data to tidy, adjust, or change the form of the data? If so, why did you do this?
There are several plots were distributed positively skewed.
* Free sulfur dioxide plot
* Total sulfur dioxide plot
* Alcohol plot
* Citric acid plot
Since this data is tidy, I didn't perform any process to adjust form of the data.
# Bivariate Plots Section
I'm going to check relation between features.
First, let's check relations with making pair plot.
The plot is created as subtracting 500 samples from whole dataset.
## Pair plot
```{r echo=FALSE,message=FALSE, pairplot,warning=FALSE, fig.height=10, fig.width=10}
set.seed(1234)
sub_wine <- wine[,c("fixed.acidity","volatile.acidity","citric.acid","residual.sugar","chlorides", "free.sulfur.dioxide","total.sulfur.dioxide","density","pH","sulphates","alcohol","quality")]
sub_wine$quality <- as.factor(sub_wine$quality)
names(sub_wine)
ggpairs(sub_wine[sample.int(nrow(sub_wine), 500),])
```
As seeing pair plot we can say,
- The quality of wine looks relative to volatile acidity, citric acidity, sulphates, alcohol, free sulfur dioxide and total sulfur dioxide.
- There are negative and positvie correltion between some variables.
Let's check them out.
## Scatter plot
### fixed acidity vs density, citric acid, pH
```{r echo=FALSE, scatterplot1}
p1 <- ggplot(wine, aes(x = fixed.acidity, y = density)) +
geom_point(alpha = 0.4) +
geom_smooth(method = "lm") +
xlab("Fixed Acidity (tartaric acid - g / dm^3)") +
ylab("Density Histogram (binwidth = 0.3 * 10 ^-3)")
p2 <- ggplot(wine, aes(x = fixed.acidity, y = citric.acid)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Fixed Acidity (tartaric acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)")
p3 <- ggplot(wine, aes(x = fixed.acidity, y = pH)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Fixed Acidity (tartaric acid - g / dm^3)")
grid.arrange(p1,p2,p3)
cor.test(wine$fixed.acidity,wine$density,method="pearson")
cor.test(wine$fixed.acidity,wine$citric.acid,method="pearson")
cor.test(wine$fixed.acidity,wine$pH,method="pearson")
```
Fixed acidity is positively correlated with density and citric acid, while negatively correlated with pH.
### volatile acidity vs citric acid
```{r echo=FALSE,message=FALSE,warning=FALSE, scatterplot2}
ggplot(wine, aes(x = volatile.acidity, y = citric.acid)) +
geom_point() +
geom_smooth(method = "lm") +
ylim(c(0,1)) +
xlab("Volatile Acidity (acetic acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)")
cor.test(wine$volatile.acidity,wine$citric.acid,method="pearson")
```
Volatile acidity is negatively correlated with citric acid.
### alcohol vs density
```{r echo=FALSE, scatterplot3}
ggplot(wine, aes(x = density, y = alcohol)) +
geom_point() +
geom_smooth(method = "lm")+
ylab("Alcohol (% by volume)") +
xlab("Density (g / cm^3)")
cor.test(wine$density,wine$alcohol,method="pearson")
```
Density is negatively correlated with alcohol.
Since alcohol makes density of wine lower, there are negatively correlated.
### free sulfur dioxide vs total sulfur dioxide
```{r echo=FALSE, scatterplot4}
ggplot(wine, aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Total sulfur dioxide (mg / dm^3)") +
ylab("Free sulfur dioxide (mg / dm^3)")
```
There are 2 outliers on the right side. There are no data points around them. So, before getting linear regression model, let's remove them.
```{r echo=FALSE, scatterplot5}
idx <- wine$total.sulfur.dioxide < 200
ggplot(wine[idx,], aes(x = total.sulfur.dioxide, y = free.sulfur.dioxide)) +
geom_point() +
geom_smooth(method = "lm") +
xlab("Total sulfur dioxide (mg / dm^3)") +
ylab("Free sulfur dioxide (mg / dm^3)")
summary(lm(wine[idx,], formula = free.sulfur.dioxide ~ total.sulfur.dioxide))
cor.test(wine[idx,]$total.sulfur.dioxide,wine[idx,]$free.sulfur.dioxide,method="pearson")
```
Total sulfur dioxide and free sulfur dioxide are positively correlated.
## Box plot
```{r echo=FALSE, boxplot}
p1 <- ggplot(sub_wine, aes(x = quality, y = volatile.acidity)) +
geom_boxplot() +
ylab("Volatile Acidity (acetic acid - g / dm^3)")
p2 <- ggplot(sub_wine, aes(x = quality, y = citric.acid)) +
geom_boxplot() +
ylab("Citric Acid (g / dm^3)")
p3 <- ggplot(sub_wine, aes(x = quality, y = sulphates)) +
geom_boxplot() +
ylab("Sulphates (potassium sulphate - g / dm3)")
p4 <- ggplot(sub_wine, aes(x = quality, y = alcohol)) +
geom_boxplot() +
ylab("Alcohol (% by volume)")
p5 <- ggplot(sub_wine, aes(x = quality, y = density)) +
geom_boxplot() +
ylab("Density (g / cm^3)")
p6 <- ggplot(sub_wine, aes(x = quality, y = pH)) +
geom_boxplot()
grid.arrange(p1,p2,p3,p4,p5,p6, ncol = 3)
```
The quality of wine is positively correlated with alcohol, citric acid and sulphates and negatively correlated with volatile acidity, pH and density.
### alcohol density plot
```{r echo=FALSE, Multivariate_Plots4}
wine$level <- cut(wine$quality, c(0,4,6,10), labels = c("low","middle","high"), include.lowest = T)
ggplot(wine,aes(alcohol, col = level, fill = level)) +
geom_density(alpha= 0.1) +
xlab("Alcohol (% by volume)")
```
This chart shows how alcohol percent highly effects the quality level.
The wine with high alcohol has higher probablity to be a high quality wine.
## Linear model
```{r echo=FALSE, linear_model}
summary(lm(wine,formula = quality ~ alcohol + volatile.acidity + sulphates + citric.acid + density + pH))
```
Linear model with 6 values explains describe 34.21% of variablity in quality, density and citric.acid are statistically unsignificant, there is likely to be no relationship between citric acid and density.
## Bivariate Analysis
##### Talk about some of the relationships you observed in this part of the investigation. How did the feature(s) of interest vary with other features in the dataset?
I found relationships between some variables.
* negative correlation
- Fixed acidity vs pH
- volatile acidity vs citric acid
- alcohol vs density
* positive correlation
- fixed acidity vs density
- fixed acidity vs citric acid
- free sulfur dioxide vs total sulfur dioxideree sulfur dioxide vs total sulfur dioxide
##### Did you observe any interesting relationships between the other features (not the main feature(s) of interest)?
Negative correlation between volatile acidity and citric acid is interesting.
It is not what i expected.
##### What was the strongest relationship you found?
Relationship between fixed acidity and pH is strongest.
# Multivariate Plots Section
I grouped the quality attribute as level attribute.
* Quality 3 and 4 -> low
* Quality 5 and 6 -> middle
* Quality 7 and 8 -> high
The polygons are drawn in confidence interval 0.95.
### volatile acidity , citric acid quality
```{r echo=FALSE, Multivariate_Plots1}
wine$fquality <- as.factor(wine$quality)
p1 <- ggplot(wine,(aes(x=volatile.acidity, y = citric.acid, col = fquality))) +
geom_jitter(alpha = 0.8) +
scale_color_brewer() +
xlab("Volatile Acidity (acetic acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)") + theme_dark()
p2 <- ggplot(wine,(aes(x=volatile.acidity, y = citric.acid, col = level))) +
geom_jitter(alpha = 0.3) +
stat_ellipse(geom = "polygon", alpha = 0.1, aes(fill = level)) +
xlab("Volatile Acidity (acetic acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)") + theme_dark()
grid.arrange(p1,p2,ncol= 2)
```
High quality wines have higher citric acid and lower volatile aicidity, while low quality wines have lower citric acid and higher volatile acidity.
### alcohol, citric acid, quality
```{r echo=FALSE, Multivariate_Plots2}
p1 <- ggplot(wine,(aes(y=alcohol, x = citric.acid, col = fquality))) +
geom_jitter(alpha = 0.8) +
scale_color_brewer() +
xlab("Citric Acid (g / dm^3)")+
ylab("Alcohol (% by volume)")
p2 <- ggplot(wine,(aes(y=alcohol, x = citric.acid, col = level))) +
geom_jitter(alpha = 0.3) +
stat_ellipse(geom = "polygon", alpha = 0.1, aes(fill = level)) +
xlab("Citric Acid (g / dm^3)") +
ylab("Alcohol (% by volume)")
grid.arrange(p1,p2,ncol= 2)
```
High quality wines have higher alcohol and citric acid. Middle and low quality have similar alcohol , but middle quality alcohol has more citric acid.
There is no relationship between alcohol and citric acid.
### volatile acidity, level of quality, alcohol
```{r echo=FALSE,warning=FALSE,message=FALSE, Multivariate_Plots3}
wine$fquality <- as.factor(wine$quality)
ggplot(wine,(aes(x=alcohol, y = volatile.acidity, col = fquality ))) +
geom_jitter(alpha = 0.6) +
geom_smooth(method = lm,se = FALSE) +
scale_color_brewer() +
ylim(c(0,1)) +
xlab("Alcohol (% by volume)") +
ylab("Volatile Acidity (acetic acid - g / dm^3)")
```
As quality of wines goes better, the relation between volatile acidity and alcohol is positive except for lowest quality of wine. Also, the more volatile alcohol, quality of wine goes worse.
## Multivariate Analysis
##### Talk about some of the relationships you observed in this part of the investigation. Were there features that strengthened each other in terms of looking at your feature(s) of interest?
Grouping qulity of wines in scatter plot with Citric acid and volatile acidity, show me clearly that higher citric acid and lower volatile acidity makes quality of wines be better.
There is no relation between alcohol and citric acid by looking at scatter plot. However, the plotting it with level of quality shows me that alcohol is really important variable to determine quality of wines high and citric acid attribute is also pretty important variable to determine a quality of wines.
##### Were there any interesting or surprising interactions between features?
In high quality of wines, most of wines which have low alcohol have high citric acid value and low volatile acidity. When high quality wine have low citric acidity and high volatile acidity, they have high level of alcohol.
##### OPTIONAL: Did you create any models with your dataset? Discuss the strengths and limitations of your model.
I created a linear model to expect quality of wines in bivariate plots section with alcohol, volatile acidity, sulphates, citric acid, density and pH. However, it can explain 34.21% of variablity in quality which means it is not accracy.
# Final Plots and Summary
### Plot One
```{r echo=FALSE, Plot_One}
p1 <- ggplot(sub_wine, aes(x = quality, y = volatile.acidity, fill = quality))+
geom_violin(alpha =0.5) +
geom_boxplot(width=0.1,alpha = 0.5) +
stat_summary(fun.y = median, aes(group = 1), geom = "line") +
ylab("Volatile Acidity (acetic acid - g / dm^3)") +
xlab("Quality") +
guides(fill=FALSE)
p2 <- ggplot(sub_wine, aes(x = quality, y = citric.acid, fill = quality)) +
geom_violin(alpha =0.5) +
geom_boxplot(width=0.1,alpha = 0.5) +
stat_summary(fun.y = median, aes(group = 1), geom = "line") +
ylab("Volatile Acidity (acetic acid - g / dm^3)") +
xlab("Quality") +
guides(fill=FALSE)
p3 <- ggplot(sub_wine, aes(x = quality, y = sulphates, fill = quality)) +
geom_violin(alpha =0.5) +
geom_boxplot(width=0.1,alpha = 0.5) +
stat_summary(fun.y = median, aes(group = 1), geom = "line") +
guides(fill=FALSE)
p4 <- ggplot(sub_wine, aes(x = quality, y = alcohol, fill = quality)) +
geom_violin(alpha =0.5) +
geom_boxplot(width=0.1,alpha = 0.5) +
stat_summary(fun.y = median, aes(group = 1), geom = "line") +
guides(fill=FALSE)
grid.arrange(p1,p2, ncol = 2,top='violin plot')
```
### Description One
As creating violin plots with box plot, we can see distribution of volatile acidity for each quality of wines. As quality of wine goes better, volatile acidity is distributed at lower level and citric acid is distributed at higher level.
The black lines among the median of each quality support volatile acidity and quality is negatively related. Also, it supports citric acidity and quality is positively related.
### Plot Two
```{r echo=FALSE, Plot_Two}
p1 <- ggplot(wine,(aes(x=volatile.acidity, y = citric.acid, col = level))) +
geom_jitter(alpha = 0.3) +
guides(fill=FALSE) +
stat_ellipse(geom = "polygon", alpha = 0.1, aes(fill = level)) +
xlab("Volatile Acidity (acetic acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)") +
labs(colour = "Quality Level")
p2 <- ggplot(wine,(aes(x=volatile.acidity, y = citric.acid, col = level))) + guides(fill=FALSE) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.10) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.20) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.30) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.40) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.50) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.05) +
stat_ellipse(geom = "polygon", alpha = 0.2, aes(fill = level),level = 0.01) +
xlab("Volatile Acidity (acetic acid - g / dm^3)") +
ylab("Citric Acid (g / dm^3)") +
labs(colour = "Quality Level")
grid.arrange(p1,p2,ncol= 2, top='Citric acid vs Volatile acidity with Quality')
```
### Description Two
As creating more ellipses on the right side, we can see there are quality level. The less volatile acidity and the more citric acid determine quality of wine better.
(confidence intaval : 0.5, 0.4, 0.3, 0.2, 0.1, 0.05, 0.01)
### Plot Three
```{r echo=FALSE, Plot_Three}
p1 <- ggplot(wine,aes(alcohol, col = level, fill = level)) +
geom_density(alpha= 0.1) +
ylab("Density") +
xlab("Alcohol (% by volume)")
p2 <- ggplot(wine, aes(alcohol, col = level)) + stat_ecdf(geom = "line") +
ylab("Rate") +
xlab("Alcohol (% by volume)")
grid.arrange(p1,p2, ncol = 2, top='Density of Alcohol')
```
### Description Three
I added a ecdf plot on the right side. A rate in high quality of wines begins to rise at higher density of alcohol than others. As looking both plots, there is no big differences between low and middle quality of wines. However, In high quality of alcohol, It's pretty different both middle and low quality of alcohol.
------
# Reflection
This data set contains a lot of surprising information on red wines and their chemical properties. From each step, I did exploration data analysis one variable, two variables and more variables. I found what features are related to quality of wine.
I wish the data-set include other variables like measure of wine price, the place where wine made in or etc. That data set would ask us more interesting questions.
I was able to create a linear model to expect quality from new data, but that model was not accurate. If this dataset had quality variables as continuous, this analysis would be more accurate. With continuous taget variable, we could scale quality variable to get better visualization. That would make result clearer and be really good to make a linear model better. There might be still good ways to expect quality of wines with another kind of a model.
For exploring this dataset, i've tried to make a scatter plot. But, since the size of dataset is large, each data points are overlapped. That makes a plot bad view. Even adjusting color and opacity didn't work well. Also, it makes me struggled to make a bubble chart in multivariate analysis. For this reason, i used 'stat ellipse' function and 'stat smooth' function which really helped me to get better plots.