Skip to content

Commit

Permalink
Merge pull request #18 from LaunchCodeEducation/cleaning-pandas
Browse files Browse the repository at this point in the history
Cleaning Data with Pandas
  • Loading branch information
gildedgardenia authored Mar 12, 2024
2 parents 305009b + 9bb022c commit 5e2bfba
Show file tree
Hide file tree
Showing 10 changed files with 441 additions and 0 deletions.
26 changes: 26 additions & 0 deletions content/cleaning-pandas/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,26 @@
+++
pre = "<b>15. </b>"
chapter = true
title = "Cleaning Data with Pandas"
date = 2024-02-27T13:59:58-06:00
draft = false
weight = 15
+++

## Learning Objectives

Upon completing all the content in this chapter, you should be able to do the following:

1. Use Pandas to locate and resolve issues related to all four types of dirty data: missing data, irregular data, unnecessary data, and inconsistent data.

## Key Terminology

These are the key terms for this chapter broken down by the page the term first appears on. Make note of these terms and their definitions as you read.

### Handling Missing Data

1. interpolation

## Content Links

{{% children %}}
22 changes: 22 additions & 0 deletions content/cleaning-pandas/exercises/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,22 @@
+++
title = "Exercises: Cleaning Data with Pandas"
date = 2021-10-01T09:28:27-05:00
draft = false
weight = 2
+++

## Getting Started

Open up `data-analysis-projects/cleaning-data-with-pandas/exercises/PandasCleaningTechniques.ipynb`.

## Code Along

1. Download [Women's E-commerce Clothing Reviews Dataset](https://www.kaggle.com/nicapotato/womens-ecommerce-clothing-reviews).
1. Add it to your Jupyter Notebook.
1. Work through the exercise notebook using the dataset.

## Submitting Your Work

When finished make sure to push your changes up to GitHub.

Copy the link to your GitHub repository and paste it into the submission box in Canvas for **Exercises: Cleaning Data** and click *Submit*.
13 changes: 13 additions & 0 deletions content/cleaning-pandas/next-steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,13 @@
+++
title = "Next Steps"
date = 2021-10-01T09:28:27-05:00
draft = false
weight = 4
+++

Now that we have cleaned our data, you are ready to dive into data manipulation with pandas. If you want to review cleaning data with Pandas before continuing onward, here are some of our favorite resources:

1. [Working with Missing Data](https://pandas.pydata.org/docs/user_guide/missing_data.html)
1. [Detect and Remove the Outliers using Python](https://www.geeksforgeeks.org/detect-and-remove-the-outliers-using-python/)
1. [Pandas - Fixing Wrong Data](https://www.w3schools.com/python/pandas/pandas_cleaning_wrong_data.asp)
1. [Pandas - Cleaning Data with Wrong Format](https://www.w3schools.com/python/pandas/pandas_cleaning_wrong_format.asp)
10 changes: 10 additions & 0 deletions content/cleaning-pandas/reading/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
+++
title = "Reading"
date = 2024-02-27T13:59:58-06:00
draft = false
weight = 1
+++

## Reading Content

{{% children %}}
73 changes: 73 additions & 0 deletions content/cleaning-pandas/reading/inconsistent-data/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,73 @@
+++
title = "Handling Inconsistent Data"
draft = false
weight = 5
+++

Our final type of dirty data we want to clean is inconsistent data. Inconsistent data is data that is not properly formatted for the analysis. This could be data that is strings but should be numbers (`'0'` instead of `0` or `'one hundred'` instead of `100`). Let's study `etsy_sellers` for the final time to see how we can detect and handle inconsistent data.

```console
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 0
1 908764 Ducky Ducks 5,478 3.8 10 True
2 7463529 Candy Yarns 89,974 4.8 18 True
3 161729 Parks Pins 6,897 4.9 87 True
4 4217 Sierra's Stationary 112,988 4.3 347 0
5 21378 Star Stitchery 53,483 4.2 52 0
```

Star sellers meet Etsy's highest standard of customer service so it makes sense that we would have a column in our dataset to designate whether or not someone is a star seller. If we take a look at the new `Star_Seller` column, we can see that all the star sellers are labeled with `'True'` and those who aren't star sellers have `'0'`. Now we need to resolve this inconsistency in order for us to do an effective analysis with the `Star_Seller` column. We can either convert everything to booleans or to numbers.

## The Numbers Era of Star Sellers

We are going to start by converting everything to numbers. Once everything in the column is converted to numbers, it will be easier for us to convert the column to booleans. The sellers that are not star sellers are designated with a `'0'` so converting that string to a number is going to be a little more straightforward than converting the string `'True'` to `1`.

1. First, let's focus in on turning `'True'` to `'1'`.

```python
etsy_sellers = etsy_sellers.loc[etsy_sellers['Star_Seller'] == 'True'] = '1'
```

This code will replace all the values in the `Star_Seller` column with `'1'` only if that value is currently equal to `'True'`.

1. We can now convert the whole `Star_Seller` column to integers.

```python
etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('int64')
```

`astype()` allows us to convert a dataframe or column of a dataframe to a specific type, in this case, `int64`.

After all of this, our dataframe will look a lot more like this:

```console
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 0
1 908764 Ducky Ducks 5,478 3.8 10 1
2 7463529 Candy Yarns 89,974 4.8 18 1
3 161729 Parks Pins 6,897 4.9 87 1
4 4217 Sierra's Stationary 112,988 4.3 347 0
5 21378 Star Stitchery 53,483 4.2 52 0
```

## The Booleans Era of Star Sellers

With the whole `Star_Seller` column converted to integers, we just have to do one more step to convert the whole column to booleans.

```python
etsy_sellers.Star_Seller = etsy_sellers.Star_Seller.astype('bool')
```

With this step, `etsy_sellers` is going to become:

```console
Seller_Id Seller Sales Total_Rating Current_Items Star_Seller
0 8967 Orchid Jewels 17,896 4.5 22 False
1 908764 Ducky Ducks 5,478 3.8 10 True
2 7463529 Candy Yarns 89,974 4.8 18 True
3 161729 Parks Pins 6,897 4.9 87 True
4 4217 Sierra's Stationary 112,988 4.3 347 False
5 21378 Star Stitchery 53,483 4.2 52 False
```

Whether you convert the column to booleans or stay with integers depends entirely on what you need from your analysis and what you find easier to work with later on. This is the case with a lot of cleaning data. The approach you take to cleaning data is heavily dependent on you and what you are hoping to achieve with your analysis. The key for now is to practice and not be afraid to try something new.
16 changes: 16 additions & 0 deletions content/cleaning-pandas/reading/introduction/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,16 @@
+++
title = "Revisiting Cleaning Data"
draft = false
weight = 1
+++

As we discussed in the [previous chapter]({{% relref "../../../cleaning-spreadsheets" %}}) on cleaning data, we need to clean our data to ensure that our analysis is accurate. For example, if we want to project the price of a stock several months from now, then we would need to use as much data as possible for our analysis. If the data is not clean, then our analysis could be thrown off and depending on how unclean the data is, the predicted price could end up hundreds off. This is why we clean our data before diving into further analysis. By cleaning our data first, we can ensure that the data points being used in the analysis are what we need.

As we previously covered, there are four types of dirty data:

1. missing data
1. irregular data
1. unnecessary data
1. inconsistent data

While we learned lots of different ways to use spreadsheets to clean data, let's see how we can use pandas to clean data. Because pandas is built for data analysis, the library comes with different ways of handling all four dirty data types. Let's examine each dirty data type and how we can clean it in pandas.
53 changes: 53 additions & 0 deletions content/cleaning-pandas/reading/irregular-data/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,53 @@
+++
title = "Handling Irregular Data"
draft = false
weight = 3
+++

Irregular data refers to outliers. Outliers are data points that are abnormal. An abnormal data point might be a stock price dropping by over 10x in a day or a heart rate increasing 3x from the resting heart rate while out on a run. As you approach different outliers, you should recognize that abnormalities do happen in real life so while something seems out of the realm of possibility, we should carefully consider what happened before dismissing it and removing it from the dataset. In the case of the heart rate example, if the patient had a resting heart rate of 100 beats per minute, rising to 300 beats per minute even after exercise, could cause disastrous health effects. Is the dataset about people suffering from tachycardia (an increased heart rate) or other cardiovascular health conditions? Or is the dataset concerning healthy adults and the effects of running on their wellbeing? If it is about tachycardia, then while 300 beats per minute seems like an outlier, we might want to keep it in. If the dataset is about healthy adults engaging in running, then 300 beats per minute might mean that the heart rate was not collected properly or a number was mistyped and we might want to remove this outlier.

Let's revisit `etsy_sellers` and see if we have any irregular data we should clean.

```console
Seller Sales Total_Rating Current_Items
0 Orchid Jewels 17,896 4.5 22
1 Ducky Ducks 5,478 3.8 10
2 Candy Yarns 89,974 4.8 18
3 Parks Pins 6,897 4.9 87
4 Sierra's Stationary 112,988 6.7 347
5 Star Stitchery 53,483 4.2 52
```

Because this dataframe is so small, you might be able to spot some data points that look like outliers, so let's dive in and check out how we can investigate outliers and handle their prescence in our dataset.

## Descriptive Statistics

We have used descriptive statistics a lot so far, but it really is a data analyst's bread and butter! We might notice that the max and min of the `Sales` are pretty far apart, but since Etsy hosts all sorts of sellers from well-established ones to new businesses, it isn't out of the realm of possiblity that all those numbers are actually appropriate.

However, when we use the `descibe()` function and look more closely at `Total_Rating`, we might notice that the max is 6.7 which is an outlier. The highest number of stars a shop can have on Etsy is 5 so something is up here and we need to investigate. We can then drop the row for Sierra's Stationary by using the `drop()` function.

```python

outlier = np.where((etsy_sellers['Total_Rating'] < 0.0) & (etsy_sellers['Total_Rating'] > 5.0))
etsy_sellers.drop(etsy_sellers[outlier])
```

Even though we can visually see where Sierra's Stationary is in the dataframe, if we have one row that is off, we might have others. `np.where()` returns a list of all indices where the condition is met. In this case, the condition is that the rating must be greater than or equal to 0 and less than or equal to 5.

We can also use visualizations to detect outliers.

## Visualizing Outliers

The two most common visualization types for locating outliers are histograms and scatterplots. Which one you choose depends on what portion of your dataset you want to visualize. In the case of visualizing `Total_Rating`, a histogram might be the better option.

```python
etsy_sellers.plot.hist(column="Total_Rating")
```

We could also use a scatterplot if we wanted to try it out.

```python
etsy_sellers.plot.scatter(x="Seller",y="Total_Rating")
```

pandas comes with a number of different visualizations, so feel free to explore the different styles when on a mission to detect outliers.
146 changes: 146 additions & 0 deletions content/cleaning-pandas/reading/missing-data/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,146 @@
+++
title = "Handling Missing Data"
draft = false
weight = 2
+++

Missing data is when a value for either a row or column is not actually there. pandas has different data types for missing data so when you print out a row of a dataframe where data is missing you will see one of these data types. pandas has a number of built-in methods that can handle missing data. `None` and `NaN` both hold missing values, however, the two are not actually equivalent. The boolean expression `None == nan` evaluates to `False`. This is because `None` is a Python object and `NaN` is a floating point value. If you find yourself needing to code a custom solution to handle an issue related to missing data, you might need to keep this in mind!

{{% notice blue Note %}}

pandas has even more types to represent a missing value, such as a data type to represent a missing datetime value. For now, we will focus on `None` and `NaN`.

{{% /notice %}}

pandas can account for missing values when doing summary statistics, so we cannot count on summary statistics to detect our missing values. We need to use built-in functionality to locate these values and handle them. pandas comes with a built-in function called `isna()` to help us here.

{{% notice blue Note %}}

pandas also has a function called `isnull()` which is an alias for `isna()`. You may see this one used frequently online so keep an eye out!

{{% /notice %}}

`isna()` can be run on either a series or a dataframe. Let's first take a look at how this could be used for a series.

```python {linenos=table}
my_series = pd.Series([1,2,np.nan,4,np.nan])
my_series.isna()
```

**Console Output**

```console
0 False
1 False
2 True
3 False
4 True
```

When you use `isna()` on a series, you get a series in return. Each value in the returned series is either `True` or `False` depending on whether the value in the series was missing or not.

You will get a similar outcome with a dataframe when locating missing values. `isna()` returns a dataframe filled with `True` or `False` depending on whether a value was missing. Now that we have located the missing data, we need to handle it. Depending on what data is missing and why, you can either replace it, remove rows or columns, or further uncover the potential impact of the missing data through interpolation.

## Removing Rows or Columns with Missing Data

This is possibly the simplest option to start with. To remove a column or row that contains missing data, pandas comes with the `dropna()` function.

Throughout this chapter, we will use the variations on the following dataframe, called `etsy_sellers`, to examine how we can use pandas to clean data.

```console
Seller Sales Total_Rating Current_Items
0 Orchid Jewels 17,896 4.5 22
1 Ducky Ducks 5,478 NaN 10
2 Candy Yarns 89,974 4.8 18
3 Parks Pins NaN 4.9 NaN
4 Sierra's Stationary 112,988 4.3 347
5 Star Stitchery 53,483 4.2 52
6 NaN NaN NaN NaN
```

This dataframe has several missing data points. Let's first examine row 6, which is entirely blank. Assuming this dataset came directly from Etsy, that may indicate a shop in their records that no longer exists. If we are studying currently active Etsy sellers for our analysis, then we don't need this data so we can drop the whole row. `dropna()` removes all rows that have a missing value, so just runnning `dropna()` would remove rows 1 and 3 in addition to row 6. pandas functions come with so many different options and with every pandas function, we encourage you to always double check the documentation to see the full scope of those options. The [documentation](https://pandas.pydata.org/docs/dev/reference/api/pandas.DataFrame.dropna.html) specifies how we can drop a row where all the data is missing.

```python
etsy_sellers.dropna(how="all")
```

The above code would drop just row 6 because it is the only row with all null values. `dropna()` defaults to dropping rows, but by changing one parameter we could specify that it should drop any column that contains all missing values.

```python
etsy_sellers.dropna(axis="columns", how="all")
```

## Replacing Missing Values

```console
Seller Sales Total_Rating Current_Items
0 Orchid Jewels 17,896 4.5 22
1 Ducky Ducks 5,478 NaN 10
2 Candy Yarns 89,974 4.8 18
3 Parks Pins NaN 4.9 NaN
4 Sierra's Stationary 112,988 4.3 347
5 Star Stitchery 53,483 4.2 52
```

Now that we removed row 6, we might not want to drop any more columns and/or rows. We can then look at replacing the missing values. Whether or not this is a wise decision, depends entirely on the situation at hand. Items can be missing for any number of reasons, so before replacing a missing value, you should look into why that item is missing. In the case of `etsy_sellers`, we dove in and discovered that if a shop is currently on a break, then the system returns `NaN` for the number of current items. Parks Pins is currently on a break so none of the items on their shop are actually available for sale. In the case of our analysis, we then decide to replace all the missing values with 0. In our hypothetical situation, when a shop sells out of their items, their shop is put on a break until they add new items so 0 makes logical sense to replace our missing values with.

pandas comes with a function called `fillna()` that will help us do this. If we run the following code, we would have a problem though.

```python
etsy_sellers.fillna(0)
```

This code would actually replace every single missing value in the dataframe with 0. But we decided to be a little more intentional and want to just replace the missing values in the `Current_Items` column.

```python
cols = {"Current_Items": 0}
etsy_sellers.fillna(value=cols)
```

We can specify using a dictionary what column we want to fill and with what we want to fill it. This gives us so much more flexibility!

## Interpolating Missing Values

Because pandas can account for missing values, we can also interpolate what the missing values might be. **Interpolation** means inserting values into a dataset based on exisiting trends in the data. The `interpolate()` function includes a parameter that can specify how you want pandas to interpolate the data. The `method` parameter defaults to a linear interpolation meaning that pandas will fill in the missing values with the assumption that everything is equally spaced like a line.

```console
Seller Sales Total_Rating Current_Items
0 Orchid Jewels 17,896 4.5 22
1 Ducky Ducks 5,478 NaN 10
2 Candy Yarns 89,974 4.8 18
3 Parks Pins NaN 4.9 0
4 Sierra's Stationary 112,988 4.3 347
5 Star Stitchery 53,483 4.2 52
```

The last remaining values are in the `Total_Rating` column and the `Sales` column. Linear interpolation makes sense in neither case. We might want to interpolate what the missing rating is for Ducky Ducks based on what other values in the column are so in that case we can use the pad method.

```python
etsy_sellers.interpolate(method="pad")
```

Interpolation can be a bit of a gamble if you don't understand the underlying trends of the dataset, so you may not see it very often.

## Check Your Understanding

{{% notice green Question %}}

True or False: pandas can account for missing values when performing certain calculations such as summary statistics

{{% /notice %}}

<!-- True -->

{{% notice green Question %}}

Which pandas function detects missing values? Select all that apply.

1. `dropna()`
1. `isna()`
1. `interpolate()`
1. `fillna()`
1. `isnull()`

{{% /notice %}}

<!-- 2 and 5 -->
Loading

0 comments on commit 5e2bfba

Please sign in to comment.