This project will explore a dataset with Python and use standard Data Science skills to clean, analyze, visualize, and interpret data and elegantly use the data patterns to provide scientific meaning to a dataset I found on [kaggle]
Since I have spent the majority of my career in the FoodService and FoodService Equipment Industry, I wanted this project to be related to a culinary dataset of some kind. The Wine Review dataset caught my attention because I always had been suspicious of the concepts of quality and price being directly affect the consumer's mind. Usually "you get what you pay for" is a safe bet, but sometimes a more expensive commodity is not the "best" or preferred choice for consumers. This is especially evident in the Food Industry where epicurean prices are related to a subjective palette. I want to analyze the Wine Reviews and see if the more expensive wines are always the highest rated ones by consumers.
I think that this dataset offers some great opportunities for text related predictive models. My overall goal in the future is to use another version of this dataset (with three additional columns) to do some predictive analysis on the Text based Description
column. Ultimately I am interested in building a bot that could produce a convincing Wine Review. If anyone has any ideas, breakthroughs, or other interesting insights/models please post them~! Feel free to fork as well. I am open to constructive feedback and tips as well.
-
Import and inspect the dataset using
pandas
. -
Analyze the dataset using
pandas
andnumpy
. -
Create visualizations using
matplotlib
andseaborn
. -
Interpret meanings from the data using the
Scientific Method
("Data Science!").
Python has a rich Data Science functionality that has been motivated by teams of scientists and engineers trying to solve scientific and engineering problems. Python's Object Oriented Design, ease of syntax, and available libraries make it the industry standard for Data Analysis. A 2016 study done by O'Reily shows that Python
is now dominant over R
throughout the Data Science community, favoring Python 3.6
to the soon to be extinct Python 2.7
.
Python
has become the fastest growing programming language of 2019, and continues to remain the industry standard for modeling and analysis in the scientific and engineering industries. The Scientific Python Stack is an array of technologies that make Python so powerful for Data analysis and statistical prediction.
To get everything running in this project, use pip install -r requirements.txt
- Python 3.6 (replacing legacy Python 2.7 in 2020)
- Cython (a speedy C library for backing up numpy)
- SciPy
- NumPy
- SciKitLearn
- Anaconda IDE
- IPython Notebooks
- GitHub (version control)
- RMOTR Notebooks
- Analysis tools
- NumPy
- Pandas
- Cython
- Visualization tools
- Matplotlib
- Seaborn
- Bokeh
I searched Food related datasets on kaggle and found the Wine Review dataset. I was looking for a medium sized CSV file between 50MB - 1GB. I also wanted something that would take some processing but wasn't a wrangling project. I wanted to make some visualizations as well using the seaborn
library. I used kaggle's filtering and search and found the Wine Review dataset.
The data was scraped from Wine Enthusiast Magazine during the week of June 15th, 2017.
This dataset is 150,930 Wine Reviews in one csv file of about 50 MB:
winemag-data_first150k.csv
contains 10 columns and 150k+ rows of Wine Reviews scraped from WineEnthusiast during June of 2017.
Each record in the dataset represents a single wine review from an online user of Wine Enthusiast Magazine
The following is a brief summary of the 10 different columns of data included in winemag-data_first150k.csv
:
-
Country - The country of origin of the wine.
-
Description - The description of the wine's flavor profile.
-
Designation - The vineyard where the wine's grapes are sourced.
-
Points - The number of points Wine Enthusiast Magazine rated the wine on a scale of 1-100.
-
Price - The cost for a single bottle of the wine.
-
Province - The province or state that the wine is from.
-
Region 1 - The wine growing area in a province or state (for example, Napa Valley in California).
-
Region 2 - (Optional) A more specific region in a wine growing area (for example, Rutherford inside Napa Valley).
-
Variety - The type of grapes used to make the wine (for example, Pinot Noir).
-
Winery - The winery that made the wine.
Check out the WineReviewAnalysis Notebook for the analysis.
To get everything running, use pip install -r requirements.txt
After cleaning and inspecting the Wine Reviews dataset, we used numerical and statistical analysis to create visualizations from the dataset. Using the focused plotting of point distributions, jointplots, and heatmaps, it has been determined that the best value of wines in the 150,930 reviews is as follows:
- Made in California
- A Chardonnay, Pinot Grigio, or Cabernet Savignon
- 12.00 - 18.00 USD per bottle
- 87.5 or greater points is highly likely
-
California is well known for its Wine producing industry, and agriculture capabilities.
-
This means that overall, wines in the 10.00 - 20.00 range have frequently better ratings when compared to more expensive wines.
-
This could be due to the price point of these wines, or the fact that most consumers drink expensive wines less frequently or only for special occasions.
-
It makes sense for the wine producers to focus on the market demand for their products and target their resources towards the taste of the public.
-
There is no correlation between price and quality when comparing the majority of commercial wines.