-
Notifications
You must be signed in to change notification settings - Fork 2
Commit
This commit does not belong to any branch on this repository, and may belong to a fork outside of the repository.
Merge pull request #22 from LaunchCodeEducation/data-manipulation
Data manipulation
- Loading branch information
Showing
19 changed files
with
497 additions
and
4 deletions.
There are no files selected for viewing
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,37 @@ | ||
+++ | ||
pre = "<b>16. </b>" | ||
chapter = true | ||
title = "Data Manipulation" | ||
date = 2024-03-12T15:04:03-05:00 | ||
draft = false | ||
weight = 16 | ||
+++ | ||
|
||
## Learning Objectives | ||
After completing all of the content in this chapter, you should be able to do the following: | ||
1. Aggregate data accross multiple columns (mean, median, mode) | ||
1. Append data: stack or concatenate multiple datasets with the `.concat` function | ||
1. Recode and map values within a column to new values by providing conditional formatting | ||
1. Group data together with the `.groupby` function | ||
1. Merge columns together based on the provided column or indices | ||
1. Restructure data long vs. wide with pandas melt and pivot functionality | ||
|
||
## Key Terminology | ||
|
||
### Aggregation | ||
1. GroupBy | ||
|
||
### Recoding Data | ||
1. `.replace()` | ||
1. `.apply()` | ||
|
||
### Reshaping Tables | ||
1. `.melt()` | ||
1. `.concat()` | ||
1. `.sort_values()` | ||
1. wide format | ||
1. long format | ||
|
||
## Content Links | ||
|
||
{{% children %}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,23 @@ | ||
+++ | ||
title = "Exercises" | ||
date = 2021-10-01T09:28:27-05:00 | ||
draft = false | ||
weight = 2 | ||
+++ | ||
|
||
## Getting Started | ||
|
||
Open `data-analysis-projects/data-manipulation/exercises` into a new Jupyter notebook. | ||
|
||
## Code Along | ||
|
||
Complete the notebooks in the GitHub repository you cloned in the following order: | ||
|
||
1. `DataManipulationWorkbook` | ||
1. `MergingTables` | ||
|
||
## Submitting Your Work | ||
|
||
When finished make sure to push your changes up to GitHub. | ||
|
||
Copy the link to your GitHub repository and paste it into the submission box in Canvas for **Exercises: Data Manipulation** and click *Submit*. |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,18 @@ | ||
+++ | ||
title = "Next Steps" | ||
date = 2021-10-01T09:28:27-05:00 | ||
draft = false | ||
weight = 4 | ||
+++ | ||
|
||
Now that we learned about manipulating data, you are ready to dive into data visualization. If you want to review data manipulation with pandas before continuing onward, here are some of our favorite resources: | ||
|
||
1. [Python | Pandas dataframe.aggregate()](https://www.geeksforgeeks.org/python-pandas-dataframe-aggregate/) | ||
1. [Python | Pandas dataframe.groupby()](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/) | ||
1. [How to create new columns derived from existing columns?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html) | ||
1. [Recode Data](https://pythonfordatascienceorg.wordpress.com/recode-data/) | ||
1. [How to manipulate textual data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html) | ||
1. [How to reshape the layout of tables?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html) | ||
1. [pandas.pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html) | ||
1. [pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html) | ||
1. [pandas.pivot](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html) |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,10 @@ | ||
+++ | ||
title = "Reading" | ||
date = 2024-03-12T15:04:03-05:00 | ||
draft = false | ||
weight = 1 | ||
+++ | ||
|
||
## Reading Content | ||
|
||
{{% children %}} |
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,97 @@ | ||
+++ | ||
title = "Aggregation" | ||
date = 2024-03-12T15:04:03-05:00 | ||
draft = false | ||
weight = 1 | ||
+++ | ||
|
||
{{% notice blue Note "rocket" %}} | ||
This reading, and following readings, will provide examples from the `titanic.csv` dataset file that will also be used in the exercise portion of this chapter. | ||
{{% /notice %}} | ||
|
||
## Groupby | ||
|
||
The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The offical [pandas documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplshes the following: | ||
1. Splitting: Split the data based on the criteria provided. | ||
1. Applying: Provide an applicable function to the groups that were split. | ||
1. Combining: Combine the results from the function into a new data structure. | ||
|
||
### Syntax | ||
|
||
Syntax for the `.groupby()` method when providing a single column as a parameter is as follows: | ||
|
||
```python | ||
grouping_variable = your_data.groupby("column-name") | ||
``` | ||
|
||
{{% notice blue Example "rocket" %}} | ||
Let's take things a step further and aggregate the data within the grouped column name using the `.sum()` function through method chaining: | ||
|
||
```python | ||
grouping_variable = your_data.groupby(["column_name"]).sum() | ||
``` | ||
|
||
The above code will return the sum of all values within the provided column, giving you a count of each unique value inside. | ||
{{% /notice %}} | ||
|
||
The `.groupby()` method can take multiple columns as a parameter upon creation, but it is best practice to only provide as many columns as needed for your analysis. As you increase the amount of grouped columns, you are also increasing the amount of compute power and memory needed, which can lead to performance issues. | ||
|
||
In order to group multiple columns you can pass a list of column names as a parameter to the `.groupby` method. | ||
|
||
```python | ||
grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."]) | ||
``` | ||
|
||
{{% notice blue Example "rocket" %}} | ||
Applying an aggregate function to multipled grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example. | ||
|
||
![Creating a new groupby object from the columns "embark_town" and "alone" and applying the sum aggregate function](pictures/grouped-titanic.png?classes=border) | ||
|
||
The image below displays the output when only applying the `groupby()` method to only the `embark_town` column. | ||
|
||
![Applying a groupby method to only the "embark_town" column within the titanic.csv dataset to view the output](pictures/groupby-embark-town.png?classes=border) | ||
|
||
The key thing to note here is that when grouping multiple column(s) together it will provide you a dataset that is specific to that grouping of data. When the`embark_town` column was grouped with the `alone` column, the result is a dataset that provides an aggregate of the entire dataset in relation to those two columns. When `embark_town` was grouped alone, it provided an aggregate of the entire dataset only as it relates to the `embark_town` column. | ||
{{% /notice %}} | ||
|
||
## Aggregate Methods | ||
|
||
pandas provides a built-in aggregate method: `Data.aggregate()` or `Data.agg()` (both accomplish the same thing, `agg()` is short for `aggregate()`). The benefit of using the `.aggregate()` function is that it allows you to pass aggregate functions as a list. | ||
|
||
|
||
{{% notice blue Example "rocket" %}} | ||
```python | ||
data.agg(['mean', 'median', 'mode']) | ||
``` | ||
{{% /notice %}} | ||
|
||
### Aggregation Using a Dictionary | ||
|
||
pandas also allows the ability to provide a dictionary with columns as a key and aggregate functions as an associated value. | ||
|
||
{{% notice blue Example "rocket" %}} | ||
```python {linenos=table} | ||
aggregate_dictionary_example = { | ||
"embark_town": ["count"], | ||
"age": ["count", "median"] | ||
} | ||
|
||
dictionary_aggregate = data.agg(aggregate_dictionary_example) | ||
``` | ||
|
||
This dictionary object has now become a tempate for the aggregations we want to preform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table. | ||
{{% /notice %}} | ||
|
||
## Groupby and Multiple Aggregations | ||
|
||
A common strategy used when applying multiple aggregations to your group or dataset is to hold them within a variable. The advantage of this being, you will not have to provide the list of functions you need as parameters each and every time. | ||
|
||
{{% notice blue Example "rocket" %}} | ||
```python {linenos=table} | ||
aggregate_functions = ["mean", "median", "mode"] | ||
|
||
grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."]) | ||
|
||
grouping_variable.agg(aggregate_functions) | ||
``` | ||
{{% /notice %}} |
Binary file added
BIN
+132 KB
content/data-manipulation/reading/aggregation/pictures/groupby-embark-town.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+150 KB
content/data-manipulation/reading/aggregation/pictures/grouped-titanic.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
This file contains bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Original file line number | Diff line number | Diff line change |
---|---|---|
@@ -0,0 +1,79 @@ | ||
+++ | ||
title = "Recoding Data" | ||
date = 2024-03-12T15:04:03-05:00 | ||
draft = false | ||
weight = 2 | ||
+++ | ||
|
||
## Creating New Columns | ||
|
||
As questions arise during your data exploration and cleaning, you might want to investigate. In this instance, we want to make sure the values we are planning on manipulating remain untouched. One thing we can do is add a new column that will contain our manipulations. | ||
|
||
{{% notice blue Example "rocket" %}} | ||
```python | ||
import pandas as pd | ||
|
||
data["survived_reformatted"] = data["survived"].replace({0 : False, 1: True}) | ||
``` | ||
|
||
The above code accomplishes the following: | ||
1. Imports pandas | ||
1. Creates a new column called `survived_reformatted` from the `survived` column after replacing all `0` and `1` integers with either `False`, or `True`. | ||
|
||
Viewing the output of the dataframe you are able to see that a new `column` called `survived_reformatted` was created: | ||
|
||
![Displaying the output of our dataframe using the data.head() function](pictures/survived-reformatted.png?classes=border) | ||
{{% /notice %}} | ||
|
||
## Replacing Values | ||
|
||
Replacing values within a column to be more data friendly is a common practice. In particular, replacing strings of data to bools, where a "yes" or "no" would become `True` or `False`. We can accomplish this by using the `.replace` function. | ||
|
||
The following example simply replaces the data that exists within the column, manipulating it directly as it is, without creating a new column from the manipulation itself. | ||
|
||
{{% notice blue Example "rocket" %}} | ||
Replace the "0" and "1" integer values within the `survived` column of the Titanic dataset to `True` or `False` by passing in a dictionary as an argument into the `to_replace`. | ||
|
||
```python | ||
import pandas as pd | ||
|
||
data["survived"] = data["survived"].replace(to_replace={0: False, 1: True}) | ||
``` | ||
{{% /notice %}} | ||
|
||
## Using Functions to Manipulate Data | ||
|
||
Creating a function to aggregate data or create new columns is another common practice used when analyzing data. Pandas utilizes the `.apply()` method to execute a function on a pandas Series or DataFrame. | ||
|
||
{{% notice blue Example "rocket" %}} | ||
SUppose you wanted to know how many survivors under the age of 20 are still alive from the titanic dataset: | ||
|
||
```python | ||
import pandas as pd | ||
|
||
data = pd.read_csv("titanic.csv") | ||
|
||
def under_age_21_survivors(data): | ||
age = data['age'] | ||
alive = data['alive'] | ||
|
||
if age <= 20 and alive == "yes": | ||
return True | ||
else: | ||
return False | ||
|
||
data["under_21_survivors"] = data.apply(under_age_21_survivors, axis=1) | ||
print(data["under_21_survivors"].value_counts()) | ||
``` | ||
|
||
**Output** | ||
|
||
![pandas function that applies conditional formatting to a dataframe checking if survivors under the age of 21 are still alive](pictures/under-age-21-survivors.png?classes=border) | ||
{{% /notice %}} | ||
|
||
## Summary | ||
|
||
When recoding your data there are some things you should think about: | ||
1. Does the original data need to remain in-tact? | ||
1. What data tyes should be replaced with new values, and what type of data should the new value be? | ||
1. Would a function be useful for repetitive tasks and manipulation? |
Binary file added
BIN
+104 KB
content/data-manipulation/reading/recoding-data/pictures/survived-reformatted.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Binary file added
BIN
+5.13 KB
...ent/data-manipulation/reading/recoding-data/pictures/under-age-21-survivors.png
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Oops, something went wrong.