Skip to content

Commit

Permalink
Merge pull request #22 from LaunchCodeEducation/data-manipulation
Browse files Browse the repository at this point in the history
Data manipulation
  • Loading branch information
jwoolbright23 authored Mar 21, 2024
2 parents e8f7bce + 37e71ce commit 2b931ba
Show file tree
Hide file tree
Showing 19 changed files with 497 additions and 4 deletions.
4 changes: 2 additions & 2 deletions content/assignments/assignment4/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -12,9 +12,9 @@ is broken into five checkpoints and a final presentation.

<!-- TODO: Update these links as you work -->

1. [Selecting your business issue and dataset]({{< relref "./checkpoint-1" >}})
1. [Selecting your business issue and dataset]({{% relref "./checkpoint-1" %}})
1. [EDA]()
1. [Cleaning data]({{< relref "./checkpoint-3" >}})
1. [Cleaning data]({{% relref "./checkpoint-3" %}})
1. [Manipulate, interpret, and visualize data]()
1. [Modelling data]()
1. [Final Project Fair]()
Expand Down
2 changes: 1 addition & 1 deletion content/assignments/assignment4/checkpoint-1/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -99,4 +99,4 @@ word-processing program. Put your name in the right-hand corner and type up your
issue and provide the link to your chosen dataset. Submit your document on the Canvas
submission page for Graded Assignment #4: Checkpoint 1.

[Back to Final Project Overview]({{< relref "./../" >}})
[Back to Final Project Overview]({{% relref "./../" %}})
2 changes: 1 addition & 1 deletion content/assignments/assignment4/checkpoint-3/_index.md
Original file line number Diff line number Diff line change
Expand Up @@ -23,4 +23,4 @@ Checkpoint 3 examples can be found [here](https://github.com/LaunchCodeEducation

When finished cleaning your data, make sure to push your changes up to GitHub including your new cleaned dataset. Copy the link to your GitHub repository and paste it into the submission box in Canvas for Graded Assignment #4: Checkpoint 3 and click Submit.

[Back to Final Project Overview]({{< relref "./../" >}})
[Back to Final Project Overview]({{% relref "./../" %}})
37 changes: 37 additions & 0 deletions content/data-manipulation/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,37 @@
+++
pre = "<b>16. </b>"
chapter = true
title = "Data Manipulation"
date = 2024-03-12T15:04:03-05:00
draft = false
weight = 16
+++

## Learning Objectives
After completing all of the content in this chapter, you should be able to do the following:
1. Aggregate data accross multiple columns (mean, median, mode)
1. Append data: stack or concatenate multiple datasets with the `.concat` function
1. Recode and map values within a column to new values by providing conditional formatting
1. Group data together with the `.groupby` function
1. Merge columns together based on the provided column or indices
1. Restructure data long vs. wide with pandas melt and pivot functionality

## Key Terminology

### Aggregation
1. GroupBy

### Recoding Data
1. `.replace()`
1. `.apply()`

### Reshaping Tables
1. `.melt()`
1. `.concat()`
1. `.sort_values()`
1. wide format
1. long format

## Content Links

{{% children %}}
23 changes: 23 additions & 0 deletions content/data-manipulation/exercises/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,23 @@
+++
title = "Exercises"
date = 2021-10-01T09:28:27-05:00
draft = false
weight = 2
+++

## Getting Started

Open `data-analysis-projects/data-manipulation/exercises` into a new Jupyter notebook.

## Code Along

Complete the notebooks in the GitHub repository you cloned in the following order:

1. `DataManipulationWorkbook`
1. `MergingTables`

## Submitting Your Work

When finished make sure to push your changes up to GitHub.

Copy the link to your GitHub repository and paste it into the submission box in Canvas for **Exercises: Data Manipulation** and click *Submit*.
18 changes: 18 additions & 0 deletions content/data-manipulation/next-steps.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,18 @@
+++
title = "Next Steps"
date = 2021-10-01T09:28:27-05:00
draft = false
weight = 4
+++

Now that we learned about manipulating data, you are ready to dive into data visualization. If you want to review data manipulation with pandas before continuing onward, here are some of our favorite resources:

1. [Python | Pandas dataframe.aggregate()](https://www.geeksforgeeks.org/python-pandas-dataframe-aggregate/)
1. [Python | Pandas dataframe.groupby()](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/)
1. [How to create new columns derived from existing columns?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html)
1. [Recode Data](https://pythonfordatascienceorg.wordpress.com/recode-data/)
1. [How to manipulate textual data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html)
1. [How to reshape the layout of tables?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html)
1. [pandas.pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)
1. [pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
1. [pandas.pivot](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html)
10 changes: 10 additions & 0 deletions content/data-manipulation/reading/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,10 @@
+++
title = "Reading"
date = 2024-03-12T15:04:03-05:00
draft = false
weight = 1
+++

## Reading Content

{{% children %}}
97 changes: 97 additions & 0 deletions content/data-manipulation/reading/aggregation/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,97 @@
+++
title = "Aggregation"
date = 2024-03-12T15:04:03-05:00
draft = false
weight = 1
+++

{{% notice blue Note "rocket" %}}
This reading, and following readings, will provide examples from the `titanic.csv` dataset file that will also be used in the exercise portion of this chapter.
{{% /notice %}}

## Groupby

The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The offical [pandas documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplshes the following:
1. Splitting: Split the data based on the criteria provided.
1. Applying: Provide an applicable function to the groups that were split.
1. Combining: Combine the results from the function into a new data structure.

### Syntax

Syntax for the `.groupby()` method when providing a single column as a parameter is as follows:

```python
grouping_variable = your_data.groupby("column-name")
```

{{% notice blue Example "rocket" %}}
Let's take things a step further and aggregate the data within the grouped column name using the `.sum()` function through method chaining:

```python
grouping_variable = your_data.groupby(["column_name"]).sum()
```

The above code will return the sum of all values within the provided column, giving you a count of each unique value inside.
{{% /notice %}}

The `.groupby()` method can take multiple columns as a parameter upon creation, but it is best practice to only provide as many columns as needed for your analysis. As you increase the amount of grouped columns, you are also increasing the amount of compute power and memory needed, which can lead to performance issues.

In order to group multiple columns you can pass a list of column names as a parameter to the `.groupby` method.

```python
grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])
```

{{% notice blue Example "rocket" %}}
Applying an aggregate function to multipled grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.

![Creating a new groupby object from the columns "embark_town" and "alone" and applying the sum aggregate function](pictures/grouped-titanic.png?classes=border)

The image below displays the output when only applying the `groupby()` method to only the `embark_town` column.

![Applying a groupby method to only the "embark_town" column within the titanic.csv dataset to view the output](pictures/groupby-embark-town.png?classes=border)

The key thing to note here is that when grouping multiple column(s) together it will provide you a dataset that is specific to that grouping of data. When the`embark_town` column was grouped with the `alone` column, the result is a dataset that provides an aggregate of the entire dataset in relation to those two columns. When `embark_town` was grouped alone, it provided an aggregate of the entire dataset only as it relates to the `embark_town` column.
{{% /notice %}}

## Aggregate Methods

pandas provides a built-in aggregate method: `Data.aggregate()` or `Data.agg()` (both accomplish the same thing, `agg()` is short for `aggregate()`). The benefit of using the `.aggregate()` function is that it allows you to pass aggregate functions as a list.


{{% notice blue Example "rocket" %}}
```python
data.agg(['mean', 'median', 'mode'])
```
{{% /notice %}}

### Aggregation Using a Dictionary

pandas also allows the ability to provide a dictionary with columns as a key and aggregate functions as an associated value.

{{% notice blue Example "rocket" %}}
```python {linenos=table}
aggregate_dictionary_example = {
"embark_town": ["count"],
"age": ["count", "median"]
}

dictionary_aggregate = data.agg(aggregate_dictionary_example)
```

This dictionary object has now become a tempate for the aggregations we want to preform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table.
{{% /notice %}}

## Groupby and Multiple Aggregations

A common strategy used when applying multiple aggregations to your group or dataset is to hold them within a variable. The advantage of this being, you will not have to provide the list of functions you need as parameters each and every time.

{{% notice blue Example "rocket" %}}
```python {linenos=table}
aggregate_functions = ["mean", "median", "mode"]

grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])

grouping_variable.agg(aggregate_functions)
```
{{% /notice %}}
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
79 changes: 79 additions & 0 deletions content/data-manipulation/reading/recoding-data/_index.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,79 @@
+++
title = "Recoding Data"
date = 2024-03-12T15:04:03-05:00
draft = false
weight = 2
+++

## Creating New Columns

As questions arise during your data exploration and cleaning, you might want to investigate. In this instance, we want to make sure the values we are planning on manipulating remain untouched. One thing we can do is add a new column that will contain our manipulations.

{{% notice blue Example "rocket" %}}
```python
import pandas as pd

data["survived_reformatted"] = data["survived"].replace({0 : False, 1: True})
```

The above code accomplishes the following:
1. Imports pandas
1. Creates a new column called `survived_reformatted` from the `survived` column after replacing all `0` and `1` integers with either `False`, or `True`.

Viewing the output of the dataframe you are able to see that a new `column` called `survived_reformatted` was created:

![Displaying the output of our dataframe using the data.head() function](pictures/survived-reformatted.png?classes=border)
{{% /notice %}}

## Replacing Values

Replacing values within a column to be more data friendly is a common practice. In particular, replacing strings of data to bools, where a "yes" or "no" would become `True` or `False`. We can accomplish this by using the `.replace` function.

The following example simply replaces the data that exists within the column, manipulating it directly as it is, without creating a new column from the manipulation itself.

{{% notice blue Example "rocket" %}}
Replace the "0" and "1" integer values within the `survived` column of the Titanic dataset to `True` or `False` by passing in a dictionary as an argument into the `to_replace`.

```python
import pandas as pd

data["survived"] = data["survived"].replace(to_replace={0: False, 1: True})
```
{{% /notice %}}

## Using Functions to Manipulate Data

Creating a function to aggregate data or create new columns is another common practice used when analyzing data. Pandas utilizes the `.apply()` method to execute a function on a pandas Series or DataFrame.

{{% notice blue Example "rocket" %}}
SUppose you wanted to know how many survivors under the age of 20 are still alive from the titanic dataset:

```python
import pandas as pd

data = pd.read_csv("titanic.csv")

def under_age_21_survivors(data):
age = data['age']
alive = data['alive']

if age <= 20 and alive == "yes":
return True
else:
return False

data["under_21_survivors"] = data.apply(under_age_21_survivors, axis=1)
print(data["under_21_survivors"].value_counts())
```

**Output**

![pandas function that applies conditional formatting to a dataframe checking if survivors under the age of 21 are still alive](pictures/under-age-21-survivors.png?classes=border)
{{% /notice %}}

## Summary

When recoding your data there are some things you should think about:
1. Does the original data need to remain in-tact?
1. What data tyes should be replaced with new values, and what type of data should the new value be?
1. Would a function be useful for repetitive tasks and manipulation?
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading
Sorry, something went wrong. Reload?
Sorry, we cannot display this file.
Sorry, this file is invalid so it cannot be displayed.
Loading

0 comments on commit 2b931ba

Please sign in to comment.