GitHub - anukaal/Wine-analysis: Predicting the quality of Wine

Project Description

Project Vision Statement:

This project will explore a dataset with Python and use standard Data Science skills to clean, analyze, visualize, and interpret data and elegantly use the data patterns to provide scientific meaning to a dataset I found on [kaggle]

Since I have spent the majority of my career in the FoodService and FoodService Equipment Industry, I wanted this project to be related to a culinary dataset of some kind. The Wine Review dataset caught my attention because I always had been suspicious of the concepts of quality and price being directly affect the consumer's mind. Usually "you get what you pay for" is a safe bet, but sometimes a more expensive commodity is not the "best" or preferred choice for consumers. This is especially evident in the Food Industry where epicurean prices are related to a subjective palette. I want to analyze the Wine Reviews and see if the more expensive wines are always the highest rated ones by consumers.

I think that this dataset offers some great opportunities for text related predictive models. My overall goal in the future is to use another version of this dataset (with three additional columns) to do some predictive analysis on the Text based Description column. Ultimately I am interested in building a bot that could produce a convincing Wine Review. If anyone has any ideas, breakthroughs, or other interesting insights/models please post them~! Feel free to fork as well. I am open to constructive feedback and tips as well.

Project goals

Import and inspect the dataset using pandas.
Analyze the dataset using pandas and numpy.
Create visualizations using matplotlib and seaborn.
Interpret meanings from the data using the Scientific Method ("Data Science!").

Project Stack

Python has a rich Data Science functionality that has been motivated by teams of scientists and engineers trying to solve scientific and engineering problems. Python's Object Oriented Design, ease of syntax, and available libraries make it the industry standard for Data Analysis. A 2016 study done by O'Reily shows that Python is now dominant over R throughout the Data Science community, favoring Python 3.6 to the soon to be extinct Python 2.7.

Python has become the fastest growing programming language of 2019, and continues to remain the industry standard for modeling and analysis in the scientific and engineering industries. The Scientific Python Stack is an array of technologies that make Python so powerful for Data analysis and statistical prediction.

To get everything running in this project, use pip install -r requirements.txt

Let's take a quick tour of the Scientific Python (SciPy) stack I used for the Wine Review Analysis:

Language

Python 3.6 (replacing legacy Python 2.7 in 2020)
Cython (a speedy C library for backing up numpy)

Scientific & Numeric Power

SciPy
NumPy
SciKitLearn

Interactive Environment

Anaconda IDE
IPython Notebooks
GitHub (version control)
RMOTR Notebooks

Data Science Libraries

Analysis tools
- NumPy
- Pandas
- Cython
Visualization tools
- Matplotlib
- Seaborn
- Bokeh

Dataset Overview

I searched Food related datasets on kaggle and found the Wine Review dataset. I was looking for a medium sized CSV file between 50MB - 1GB. I also wanted something that would take some processing but wasn't a wrangling project. I wanted to make some visualizations as well using the seaborn library. I used kaggle's filtering and search and found the Wine Review dataset.

The data was scraped from Wine Enthusiast Magazine during the week of June 15th, 2017.

This dataset is 150,930 Wine Reviews in one csv file of about 50 MB:

winemag-data_first150k.csvcontains 10 columns and 150k+ rows of Wine Reviews scraped from WineEnthusiast during June of 2017.

Each record in the dataset represents a single wine review from an online user of Wine Enthusiast Magazine

The following is a brief summary of the 10 different columns of data included in winemag-data_first150k.csv:

Data Columns

Country - The country of origin of the wine.
Description - The description of the wine's flavor profile.
Designation - The vineyard where the wine's grapes are sourced.
Points - The number of points Wine Enthusiast Magazine rated the wine on a scale of 1-100.
Price - The cost for a single bottle of the wine.
Province - The province or state that the wine is from.
Region 1 - The wine growing area in a province or state (for example, Napa Valley in California).
Region 2 - (Optional) A more specific region in a wine growing area (for example, Rutherford inside Napa Valley).
Variety - The type of grapes used to make the wine (for example, Pinot Noir).
Winery - The winery that made the wine.

Analysis

Check out the WineReviewAnalysis Notebook for the analysis.

Dependencies

To get everything running, use pip install -r requirements.txt

Results

After cleaning and inspecting the Wine Reviews dataset, we used numerical and statistical analysis to create visualizations from the dataset. Using the focused plotting of point distributions, jointplots, and heatmaps, it has been determined that the best value of wines in the 150,930 reviews is as follows:

Made in California
A Chardonnay, Pinot Grigio, or Cabernet Savignon
12.00 - 18.00 USD per bottle
87.5 or greater points is highly likely

Conclusions

California is well known for its Wine producing industry, and agriculture capabilities.
This means that overall, wines in the 10.00 - 20.00 range have frequently better ratings when compared to more expensive wines.
This could be due to the price point of these wines, or the fact that most consumers drink expensive wines less frequently or only for special occasions.
It makes sense for the wine producers to focus on the market demand for their products and target their resources towards the taste of the public.
There is no correlation between price and quality when comparing the majority of commercial wines.

Name		Name	Last commit message	Last commit date
Latest commit History 4 Commits
app_data		app_data
.DS_Store		.DS_Store
README.md		README.md
WineReviewAnalysis.ipynb		WineReviewAnalysis.ipynb
requirements.txt		requirements.txt

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Project Description

Project Vision Statement:

Project goals

Project Stack

Let's take a quick tour of the Scientific Python (SciPy) stack I used for the Wine Review Analysis:

Language

Scientific & Numeric Power

Interactive Environment

Data Science Libraries

Dataset Overview

Data Columns

Analysis

Dependencies

Results

Conclusions

About

Releases

Packages

Languages

anukaal/Wine-analysis

Folders and files

Latest commit

History

Repository files navigation

Project Description

Project Vision Statement:

Project goals

Project Stack

Let's take a quick tour of the Scientific Python (SciPy) stack I used for the Wine Review Analysis:

Language

Scientific & Numeric Power

Interactive Environment

Data Science Libraries

Dataset Overview

Data Columns

Analysis

Dependencies

Results

Conclusions

About

Topics

Resources

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages