Merge pull request #18 from LaunchCodeEducation/cleaning-pandas

Cleaning Data with Pandas
LaunchCodeEducation · Mar 12, 2024 · 5e2bfba · 5e2bfba
2 parents 305009b + 9bb022c
commit 5e2bfba
Show file tree

Hide file tree

Showing 10 changed files with 441 additions and 0 deletions.
diff --git a/content/cleaning-pandas/_index.md b/content/cleaning-pandas/_index.md
@@ -0,0 +1,26 @@
++++
+pre = "<b>15. </b>"
+chapter = true
+title = "Cleaning Data with Pandas"
+date = 2024-02-27T13:59:58-06:00
+draft = false
+weight = 15
++++
+
+## Learning Objectives
+
+Upon completing all the content in this chapter, you should be able to do the following:
+
+1. Use Pandas to locate and resolve issues related to all four types of dirty data: missing data, irregular data, unnecessary data, and inconsistent data.
+
+## Key Terminology
+
+These are the key terms for this chapter broken down by the page the term first appears on. Make note of these terms and their definitions as you read.
+
+### Handling Missing Data
+
+1. interpolation
+
+## Content Links
+
+{{% children %}}
diff --git a/content/cleaning-pandas/exercises/_index.md b/content/cleaning-pandas/exercises/_index.md
@@ -0,0 +1,22 @@
++++
+title = "Exercises: Cleaning Data with Pandas"
+date = 2021-10-01T09:28:27-05:00
+draft = false
+weight = 2
++++
+
+## Getting Started
+
+Open up `data-analysis-projects/cleaning-data-with-pandas/exercises/PandasCleaningTechniques.ipynb`.
+
+## Code Along
+
+1. Download [Women's E-commerce Clothing Reviews Dataset](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews).
+1. Add it to your Jupyter Notebook.
+1. Work through the exercise notebook using the dataset.
+
+## Submitting Your Work
+
+When finished make sure to push your changes up to GitHub. 
+
+Copy the link to your GitHub repository and paste it into the submission box in Canvas for **Exercises: Cleaning Data** and click *Submit*.
diff --git a/content/cleaning-pandas/next-steps.md b/content/cleaning-pandas/next-steps.md
@@ -0,0 +1,13 @@
++++
+title = "Next Steps"
+date = 2021-10-01T09:28:27-05:00
+draft = false
+weight = 4
++++
+
+Now that we have cleaned our data, you are ready to dive into data manipulation with pandas. If you want to review cleaning data with Pandas before continuing onward, here are some of our favorite resources:
+
+1. [Working with Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html)
+1. [Detect and Remove the Outliers using Python](https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/)
+1. [Pandas - Fixing Wrong Data](https://www.w3schools.com/python/pandas/pandas_cleaning_wrong_data.asp)
+1. [Pandas - Cleaning Data with Wrong Format](https://www.w3schools.com/python/pandas/pandas_cleaning_wrong_format.asp)
diff --git a/content/cleaning-pandas/reading/_index.md b/content/cleaning-pandas/reading/_index.md
@@ -0,0 +1,10 @@
++++
+title = "Reading"
+date = 2024-02-27T13:59:58-06:00
+draft = false
+weight = 1
++++
+
+## Reading Content
+
+{{% children %}}
diff --git a/content/cleaning-pandas/reading/inconsistent-data/_index.md b/content/cleaning-pandas/reading/inconsistent-data/_index.md
@@ -0,0 +1,73 @@
++++
+title = "Handling Inconsistent Data"
+draft = false
+weight = 5
++++
+
+Our final type of dirty data we want to clean is inconsistent data. Inconsistent data is data that is not properly formatted for the analysis. This could be data that is strings but should be numbers (`'0'` instead of `0` or `'one hundred'` instead of `100`). Let's study `etsy_sellers` for the final time to see how we can detect and handle inconsistent data. 
+
+```console
+   Seller_Id  Seller               Sales     Total_Rating     Current_Items  Star_Seller
+0  8967       Orchid Jewels        17,896    4.5              22             0 
+1  908764     Ducky Ducks          5,478     3.8              10             True 
+2  7463529    Candy Yarns          89,974    4.8              18             True 
+3  161729     Parks Pins           6,897     4.9              87             True
+4  4217       Sierra's Stationary  112,988   4.3              347            0
+5  21378      Star Stitchery       53,483    4.2              52             0
+```
+
+Star sellers meet Etsy's highest standard of customer service so it makes sense that we would have a column in our dataset to designate whether or not someone is a star seller. If we take a look at the new `Star_Seller` column, we can see that all the star sellers are labeled with `'True'` and those who aren't star sellers have `'0'`. Now we need to resolve this inconsistency in order for us to do an effective analysis with the `Star_Seller` column. We can either convert everything to booleans or to numbers.
+
+## The Numbers Era of Star Sellers
+
+We are going to start by converting everything to numbers. Once everything in the column is converted to numbers, it will be easier for us to convert the column to booleans. The sellers that are not star sellers are designated with a `'0'` so converting that string to a number is going to be a little more straightforward than converting the string `'True'` to `1`.
+
+1. First, let's focus in on turning `'True'` to `'1'`.
+
+   ```python
+   etsy_sellers = etsy_sellers.loc[etsy_sellers['Star_Seller'] == 'True'] = '1'
+   ```
+
+   This code will replace all the values in the `Star_Seller` column with `'1'` only if that value is currently equal to `'True'`.
+
+1. We can now convert the whole `Star_Seller` column to integers.
+
+   ```python
+   etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('int64')
+   ```
+
+   `astype()` allows us to convert a dataframe or column of a dataframe to a specific type, in this case, `int64`.
+
+After all of this, our dataframe will look a lot more like this:
+
+```console
+   Seller_Id  Seller               Sales     Total_Rating     Current_Items  Star_Seller
+0  8967       Orchid Jewels        17,896    4.5              22             0 
+1  908764     Ducky Ducks          5,478     3.8              10             1 
+2  7463529    Candy Yarns          89,974    4.8              18             1 
+3  161729     Parks Pins           6,897     4.9              87             1
+4  4217       Sierra's Stationary  112,988   4.3              347            0
+5  21378      Star Stitchery       53,483    4.2              52             0
+```
+
+## The Booleans Era of Star Sellers
+
+With the whole `Star_Seller` column converted to integers, we just have to do one more step to convert the whole column to booleans.
+
+```python
+etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('bool')
+```
+
+With this step, `etsy_sellers` is going to become:
+
+```console
+   Seller_Id  Seller               Sales     Total_Rating     Current_Items  Star_Seller
+0  8967       Orchid Jewels        17,896    4.5              22             False
+1  908764     Ducky Ducks          5,478     3.8              10             True 
+2  7463529    Candy Yarns          89,974    4.8              18             True 
+3  161729     Parks Pins           6,897     4.9              87             True
+4  4217       Sierra's Stationary  112,988   4.3              347            False
+5  21378      Star Stitchery       53,483    4.2              52             False
+```
+
+Whether you convert the column to booleans or stay with integers depends entirely on what you need from your analysis and what you find easier to work with later on. This is the case with a lot of cleaning data. The approach you take to cleaning data is heavily dependent on you and what you are hoping to achieve with your analysis. The key for now is to practice and not be afraid to try something new.
diff --git a/content/cleaning-pandas/reading/introduction/_index.md b/content/cleaning-pandas/reading/introduction/_index.md
@@ -0,0 +1,16 @@
++++
+title = "Revisiting Cleaning Data"
+draft = false
+weight = 1
++++
+
+As we discussed in the [previous chapter]({{% relref "../../../cleaning-spreadsheets" %}}) on cleaning data, we need to clean our data to ensure that our analysis is accurate. For example, if we want to project the price of a stock several months from now, then we would need to use as much data as possible for our analysis. If the data is not clean, then our analysis could be thrown off and depending on how unclean the data is, the predicted price could end up hundreds off. This is why we clean our data before diving into further analysis. By cleaning our data first, we can ensure that the data points being used in the analysis are what we need.
+
+As we previously covered, there are four types of dirty data:
+
+1. missing data
+1. irregular data
+1. unnecessary data
+1. inconsistent data
+
+While we learned lots of different ways to use spreadsheets to clean data, let's see how we can use pandas to clean data. Because pandas is built for data analysis, the library comes with different ways of handling all four dirty data types. Let's examine each dirty data type and how we can clean it in pandas.
diff --git a/content/cleaning-pandas/reading/irregular-data/_index.md b/content/cleaning-pandas/reading/irregular-data/_index.md
@@ -0,0 +1,53 @@
++++
+title = "Handling Irregular Data"
+draft = false
+weight = 3
++++
+
+Irregular data refers to outliers. Outliers are data points that are abnormal. An abnormal data point might be a stock price dropping by over 10x in a day or a heart rate increasing 3x from the resting heart rate while out on a run. As you approach different outliers, you should recognize that abnormalities do happen in real life so while something seems out of the realm of possibility, we should carefully consider what happened before dismissing it and removing it from the dataset. In the case of the heart rate example, if the patient had a resting heart rate of 100 beats per minute, rising to 300 beats per minute even after exercise, could cause disastrous health effects. Is the dataset about people suffering from tachycardia (an increased heart rate) or other cardiovascular health conditions? Or is the dataset concerning healthy adults and the effects of running on their wellbeing? If it is about tachycardia, then while 300 beats per minute seems like an outlier, we might want to keep it in. If the dataset is about healthy adults engaging in running, then 300 beats per minute might mean that the heart rate was not collected properly or a number was mistyped and we might want to remove this outlier.
+
+Let's revisit `etsy_sellers` and see if we have any irregular data we should clean.
+
+```console
+   Seller                Sales     Total_Rating     Current_Items
+0  Orchid Jewels         17,896    4.5              22
+1  Ducky Ducks           5,478     3.8              10
+2  Candy Yarns           89,974    4.8              18
+3  Parks Pins            6,897     4.9              87
+4  Sierra's Stationary   112,988   6.7              347
+5  Star Stitchery        53,483    4.2              52
+```
+
+Because this dataframe is so small, you might be able to spot some data points that look like outliers, so let's dive in and check out how we can investigate outliers and handle their prescence in our dataset.
+
+## Descriptive Statistics
+
+We have used descriptive statistics a lot so far, but it really is a data analyst's bread and butter! We might notice that the max and min of the `Sales` are pretty far apart, but since Etsy hosts all sorts of sellers from well-established ones to new businesses, it isn't out of the realm of possiblity that all those numbers are actually appropriate.
+
+However, when we use the `descibe()` function and look more closely at `Total_Rating`, we might notice that the max is 6.7 which is an outlier. The highest number of stars a shop can have on Etsy is 5 so something is up here and we need to investigate. We can then drop the row for Sierra's Stationary by using the `drop()` function.
+
+```python
+
+outlier = np.where((etsy_sellers['Total_Rating'] < 0.0) & (etsy_sellers['Total_Rating'] > 5.0))
+etsy_sellers.drop(etsy_sellers[outlier])
+```
+
+Even though we can visually see where Sierra's Stationary is in the dataframe, if we have one row that is off, we might have others. `np.where()` returns a list of all indices where the condition is met. In this case, the condition is that the rating must be greater than or equal to 0 and less than or equal to 5.
+
+We can also use visualizations to detect outliers. 
+
+## Visualizing Outliers
+
+The two most common visualization types for locating outliers are histograms and scatterplots. Which one you choose depends on what portion of your dataset you want to visualize. In the case of visualizing `Total_Rating`, a histogram might be the better option.
+
+```python
+etsy_sellers.plot.hist(column="Total_Rating")
+```
+
+We could also use a scatterplot if we wanted to try it out.
+
+```python 
+etsy_sellers.plot.scatter(x="Seller",y="Total_Rating")
+```
+
+pandas comes with a number of different visualizations, so feel free to explore the different styles when on a mission to detect outliers. 
diff --git a/content/cleaning-pandas/reading/missing-data/_index.md b/content/cleaning-pandas/reading/missing-data/_index.md
@@ -0,0 +1,146 @@
++++
+title = "Handling Missing Data"
+draft = false
+weight = 2
++++
+
+Missing data is when a value for either a row or column is not actually there. pandas has different data types for missing data so when you print out a row of a dataframe where data is missing you will see one of these data types. pandas has a number of built-in methods that can handle missing data. `None` and `NaN` both hold missing values, however, the two are not actually equivalent. The boolean expression `None == nan` evaluates to `False`. This is because `None` is a Python object and `NaN` is a floating point value. If you find yourself needing to code a custom solution to handle an issue related to missing data, you might need to keep this in mind!
+
+{{% notice blue Note %}}
+
+pandas has even more types to represent a missing value, such as a data type to represent a missing datetime value. For now, we will focus on `None` and `NaN`.
+
+{{% /notice %}}
+
+pandas can account for missing values when doing summary statistics, so we cannot count on summary statistics to detect our missing values. We need to use built-in functionality to locate these values and handle them. pandas comes with a built-in function called `isna()` to help us here.
+
+{{% notice blue Note %}}
+
+pandas also has a function called `isnull()` which is an alias for `isna()`. You may see this one used frequently online so keep an eye out!
+
+{{% /notice %}}
+
+`isna()` can be run on either a series or a dataframe. Let's first take a look at how this could be used for a series.
+
+```python {linenos=table}
+my_series = pd.Series([1,2,np.nan,4,np.nan])
+my_series.isna()
+```
+
+**Console Output**
+
+```console
+0   False
+1   False
+2   True
+3   False
+4   True
+```
+
+When you use `isna()` on a series, you get a series in return. Each value in the returned series is either `True` or `False` depending on whether the value in the series was missing or not.
+
+You will get a similar outcome with a dataframe when locating missing values. `isna()` returns a dataframe filled with `True` or `False` depending on whether a value was missing. Now that we have located the missing data, we need to handle it. Depending on what data is missing and why, you can either replace it, remove rows or columns, or further uncover the potential impact of the missing data through interpolation.
+
+## Removing Rows or Columns with Missing Data
+
+This is possibly the simplest option to start with. To remove a column or row that contains missing data, pandas comes with the `dropna()` function.
+
+Throughout this chapter, we will use the variations on the following dataframe, called `etsy_sellers`, to examine how we can use pandas to clean data.
+
+```console
+   Seller                Sales     Total_Rating     Current_Items
+0  Orchid Jewels         17,896    4.5              22
+1  Ducky Ducks           5,478     NaN              10
+2  Candy Yarns           89,974    4.8              18
+3  Parks Pins            NaN       4.9              NaN
+4  Sierra's Stationary   112,988   4.3              347     
+5  Star Stitchery        53,483    4.2              52 
+6  NaN                   NaN       NaN              NaN
+```
+
+This dataframe has several missing data points. Let's first examine row 6, which is entirely blank. Assuming this dataset came directly from Etsy, that may indicate a shop in their records that no longer exists. If we are studying currently active Etsy sellers for our analysis, then we don't need this data so we can drop the whole row. `dropna()` removes all rows that have a missing value, so just runnning `dropna()` would remove rows 1 and 3 in addition to row 6. pandas functions come with so many different options and with every pandas function, we encourage you to always double check the documentation to see the full scope of those options. The [documentation](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.dropna.html) specifies how we can drop a row where all the data is missing.
+
+```python
+etsy_sellers.dropna(how="all")
+```
+
+The above code would drop just row 6 because it is the only row with all null values. `dropna()` defaults to dropping rows, but by changing one parameter we could specify that it should drop any column that contains all missing values.
+
+```python
+etsy_sellers.dropna(axis="columns", how="all")
+```
+
+## Replacing Missing Values
+
+```console
+   Seller                Sales     Total_Rating     Current_Items
+0  Orchid Jewels         17,896    4.5              22
+1  Ducky Ducks           5,478     NaN              10
+2  Candy Yarns           89,974    4.8              18
+3  Parks Pins            NaN       4.9              NaN
+4  Sierra's Stationary   112,988   4.3              347     
+5  Star Stitchery        53,483    4.2              52 
+```
+
+Now that we removed row 6, we might not want to drop any more columns and/or rows. We can then look at replacing the missing values. Whether or not this is a wise decision, depends entirely on the situation at hand. Items can be missing for any number of reasons, so before replacing a missing value, you should look into why that item is missing. In the case of `etsy_sellers`, we dove in and discovered that if a shop is currently on a break, then the system returns `NaN` for the number of current items. Parks Pins is currently on a break so none of the items on their shop are actually available for sale. In the case of our analysis, we then decide to replace all the missing values with 0. In our hypothetical situation, when a shop sells out of their items, their shop is put on a break until they add new items so 0 makes logical sense to replace our missing values with.
+
+pandas comes with a function called `fillna()` that will help us do this. If we run the following code, we would have a problem though.
+
+```python
+etsy_sellers.fillna(0)
+```
+
+This code would actually replace every single missing value in the dataframe with 0. But we decided to be a little more intentional and want to just replace the missing values in the `Current_Items` column.
+
+```python 
+cols = {"Current_Items": 0}
+etsy_sellers.fillna(value=cols)
+```
+
+We can specify using a dictionary what column we want to fill and with what we want to fill it. This gives us so much more flexibility!
+
+## Interpolating Missing Values
+
+Because pandas can account for missing values, we can also interpolate what the missing values might be. **Interpolation** means inserting values into a dataset based on exisiting trends in the data. The `interpolate()` function includes a parameter that can specify how you want pandas to interpolate the data. The `method` parameter defaults to a linear interpolation meaning that pandas will fill in the missing values with the assumption that everything is equally spaced like a line.
+
+```console
+   Seller                Sales     Total_Rating     Current_Items
+0  Orchid Jewels         17,896    4.5              22
+1  Ducky Ducks           5,478     NaN              10
+2  Candy Yarns           89,974    4.8              18
+3  Parks Pins            NaN       4.9              0
+4  Sierra's Stationary   112,988   4.3              347     
+5  Star Stitchery        53,483    4.2              52 
+```
+
+The last remaining values are in the `Total_Rating` column and the `Sales` column. Linear interpolation makes sense in neither case. We might want to interpolate what the missing rating is for Ducky Ducks based on what other values in the column are so in that case we can use the pad method.
+
+```python
+etsy_sellers.interpolate(method="pad")
+```
+
+Interpolation can be a bit of a gamble if you don't understand the underlying trends of the dataset, so you may not see it very often.
+
+## Check Your Understanding
+
+{{% notice green Question %}}
+
+True or False: pandas can account for missing values when performing certain calculations such as summary statistics
+
+{{% /notice %}}
+
+<!-- True -->
+
+{{% notice green Question %}}
+
+Which pandas function detects missing values? Select all that apply.
+
+1. `dropna()`
+1. `isna()`
+1. `interpolate()`
+1. `fillna()`
+1. `isnull()`
+
+{{% /notice %}}
+
+<!-- 2 and 5 -->