Merge pull request #22 from LaunchCodeEducation/data-manipulation

Data manipulation
LaunchCodeEducation · Mar 21, 2024 · 2b931ba · 2b931ba
2 parents e8f7bce + 37e71ce
commit 2b931ba
Show file tree

Hide file tree

Showing 19 changed files with 497 additions and 4 deletions.
diff --git a/content/assignments/assignment4/_index.md b/content/assignments/assignment4/_index.md
@@ -12,9 +12,9 @@ is broken into five checkpoints and a final presentation.
 
 <!-- TODO: Update these links as you work -->
 
-1. [Selecting your business issue and dataset]({{< relref "./checkpoint-1" >}})
+1. [Selecting your business issue and dataset]({{% relref "./checkpoint-1" %}})
 1. [EDA]()
-1. [Cleaning data]({{< relref "./checkpoint-3" >}})
+1. [Cleaning data]({{% relref "./checkpoint-3" %}})
 1. [Manipulate, interpret, and visualize data]()
 1. [Modelling data]()
 1. [Final Project Fair]()

diff --git a/content/assignments/assignment4/checkpoint-1/_index.md b/content/assignments/assignment4/checkpoint-1/_index.md
@@ -99,4 +99,4 @@ word-processing program. Put your name in the right-hand corner and type up your
 issue and provide the link to your chosen dataset. Submit your document on the Canvas 
 submission page for Graded Assignment #4: Checkpoint 1.
 
-[Back to Final Project Overview]({{< relref "./../" >}})
+[Back to Final Project Overview]({{% relref "./../" %}})
diff --git a/content/assignments/assignment4/checkpoint-3/_index.md b/content/assignments/assignment4/checkpoint-3/_index.md
@@ -23,4 +23,4 @@ Checkpoint 3 examples can be found [here](https://github.com/LaunchCodeEducation
 
 When finished cleaning your data, make sure to push your changes up to GitHub including your new cleaned dataset. Copy the link to your GitHub repository and paste it into the submission box in Canvas for Graded Assignment #4: Checkpoint 3 and click Submit.
 
-[Back to Final Project Overview]({{< relref "./../" >}})
+[Back to Final Project Overview]({{% relref "./../" %}})
diff --git a/content/data-manipulation/_index.md b/content/data-manipulation/_index.md
@@ -0,0 +1,37 @@
++++
+pre = "<b>16. </b>"
+chapter = true
+title = "Data Manipulation"
+date = 2024-03-12T15:04:03-05:00
+draft = false
+weight = 16
++++
+
+## Learning Objectives
+After completing all of the content in this chapter, you should be able to do the following:
+1. Aggregate data accross multiple columns (mean, median, mode)
+1. Append data: stack or concatenate multiple datasets with the `.concat` function
+1. Recode and map values within a column to new values by providing conditional formatting
+1. Group data together with the `.groupby` function
+1. Merge columns together based on the provided column or indices
+1. Restructure data long vs. wide with pandas melt and pivot functionality
+
+## Key Terminology
+
+### Aggregation
+1. GroupBy
+
+### Recoding Data
+1. `.replace()`
+1. `.apply()`
+
+### Reshaping Tables
+1. `.melt()`
+1. `.concat()`
+1. `.sort_values()`
+1. wide format
+1. long format
+
+## Content Links
+
+{{% children %}}
diff --git a/content/data-manipulation/exercises/_index.md b/content/data-manipulation/exercises/_index.md
@@ -0,0 +1,23 @@
++++
+title = "Exercises"
+date = 2021-10-01T09:28:27-05:00
+draft = false
+weight = 2
++++
+
+## Getting Started
+
+Open `data-analysis-projects/data-manipulation/exercises` into a new Jupyter notebook.
+
+## Code Along
+
+Complete the notebooks in the GitHub repository you cloned in the following order:
+
+1. `DataManipulationWorkbook`
+1. `MergingTables`
+
+## Submitting Your Work
+
+When finished make sure to push your changes up to GitHub.
+
+Copy the link to your GitHub repository and paste it into the submission box in Canvas for **Exercises: Data Manipulation** and click *Submit*.
diff --git a/content/data-manipulation/next-steps.md b/content/data-manipulation/next-steps.md
@@ -0,0 +1,18 @@
++++
+title = "Next Steps"
+date = 2021-10-01T09:28:27-05:00
+draft = false
+weight = 4
++++
+
+Now that we learned about manipulating data, you are ready to dive into data visualization. If you want to review data manipulation with pandas before continuing onward, here are some of our favorite resources:
+
+1. [Python | Pandas dataframe.aggregate()](https://www.geeksforgeeks.org/python-pandas-dataframe-aggregate/)
+1. [Python | Pandas dataframe.groupby()](https://www.geeksforgeeks.org/python-pandas-dataframe-groupby/)
+1. [How to create new columns derived from existing columns?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/05_add_columns.html)
+1. [Recode Data](https://pythonfordatascienceorg.wordpress.com/recode-data/)
+1. [How to manipulate textual data?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/10_text_data.html)
+1. [How to reshape the layout of tables?](https://pandas.pydata.org/docs/getting_started/intro_tutorials/07_reshape_table_layout.html)
+1. [pandas.pivot_table](https://pandas.pydata.org/docs/reference/api/pandas.pivot_table.html)
+1. [pandas.melt](https://pandas.pydata.org/docs/reference/api/pandas.melt.html)
+1. [pandas.pivot](https://pandas.pydata.org/docs/reference/api/pandas.pivot.html)
diff --git a/content/data-manipulation/reading/_index.md b/content/data-manipulation/reading/_index.md
@@ -0,0 +1,10 @@
++++
+title = "Reading"
+date = 2024-03-12T15:04:03-05:00
+draft = false
+weight = 1
++++
+
+## Reading Content
+
+{{% children %}}
diff --git a/content/data-manipulation/reading/aggregation/_index.md b/content/data-manipulation/reading/aggregation/_index.md
@@ -0,0 +1,97 @@
++++
+title = "Aggregation"
+date = 2024-03-12T15:04:03-05:00
+draft = false
+weight = 1
++++
+
+{{% notice blue Note "rocket" %}}
+This reading, and following readings, will provide examples from the `titanic.csv` dataset file that will also be used in the exercise portion of this chapter.
+{{% /notice %}}
+
+## Groupby
+
+The `.groupby()` function groups data together from one or more columns. As we group the data together, it forms a new **GroupBy** object. The offical [pandas documenation](https://pandas.pydata.org/pandas-docs/stable/user_guide/groupby.html) states that a "group by" accomplshes the following:
+1. Splitting: Split the data based on the criteria provided.
+1. Applying: Provide an applicable function to the groups that were split.
+1. Combining: Combine the results from the function into a new data structure.
+
+### Syntax
+
+Syntax for the `.groupby()` method when providing a single column as a parameter is as follows:
+
+```python
+grouping_variable = your_data.groupby("column-name")
+```
+
+{{% notice blue Example "rocket" %}}
+Let's take things a step further and aggregate the data within the grouped column name using the `.sum()` function through method chaining:
+
+```python
+grouping_variable = your_data.groupby(["column_name"]).sum()
+```
+
+The above code will return the sum of all values within the provided column, giving you a count of each unique value inside.
+{{% /notice %}}
+
+The `.groupby()` method can take multiple columns as a parameter upon creation, but it is best practice to only provide as many columns as needed for your analysis. As you increase the amount of grouped columns, you are also increasing the amount of compute power and memory needed, which can lead to performance issues.
+
+In order to group multiple columns you can pass a list of column names as a parameter to the `.groupby` method.
+
+```python
+grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])
+```
+
+{{% notice blue Example "rocket" %}}
+Applying an aggregate function to multipled grouped columns can also be accomplished with method chaining. The following image uses columns from the titanic dataset as an example.
+
+![Creating a new groupby object from the columns "embark_town" and "alone" and applying the sum aggregate function](pictures/grouped-titanic.png?classes=border)
+
+The image below displays the output when only applying the `groupby()` method to only the `embark_town` column.
+
+![Applying a groupby method to only the "embark_town" column within the titanic.csv dataset to view the output](pictures/groupby-embark-town.png?classes=border)
+
+The key thing to note here is that when grouping multiple column(s) together it will provide you a dataset that is specific to that grouping of data. When the`embark_town` column was grouped with the `alone` column, the result is a dataset that provides an aggregate of the entire dataset in relation to those two columns. When `embark_town` was grouped alone, it provided an aggregate of the entire dataset only as it relates to the `embark_town` column.
+{{% /notice %}}
+
+## Aggregate Methods
+
+pandas provides a built-in aggregate method: `Data.aggregate()` or `Data.agg()` (both accomplish the same thing, `agg()` is short for `aggregate()`). The benefit of using the `.aggregate()` function is that it allows you to pass aggregate functions as a list.
+
+
+{{% notice blue Example "rocket" %}}
+```python
+data.agg(['mean', 'median', 'mode'])
+```
+{{% /notice %}}
+
+### Aggregation Using a Dictionary
+
+pandas also allows the ability to provide a dictionary with columns as a key and aggregate functions as an associated value.
+
+{{% notice blue Example "rocket" %}}
+```python {linenos=table}
+aggregate_dictionary_example = {
+    "embark_town": ["count"], 
+    "age": ["count", "median"]
+}
+
+dictionary_aggregate = data.agg(aggregate_dictionary_example)
+```
+
+This dictionary object has now become a tempate for the aggregations we want to preform. However, on it's own, it does nothing. Once passed to the agg() method, it will pick out the specific location of data we want to examine. Making a subset table. 
+{{% /notice %}}
+
+## Groupby and Multiple Aggregations
+
+A common strategy used when applying multiple aggregations to your group or dataset is to hold them within a variable. The advantage of this being, you will not have to provide the list of functions you need as parameters each and every time.
+
+{{% notice blue Example "rocket" %}}
+```python {linenos=table}
+aggregate_functions = ["mean", "median", "mode"]
+
+grouping_variable = your_data.groupby(["column_one", "column_two", "etc.."])
+
+grouping_variable.agg(aggregate_functions)
+```
+{{% /notice %}}
diff --git a/content/data-manipulation/reading/aggregation/pictures/groupby-embark-town.png b/content/data-manipulation/reading/aggregation/pictures/groupby-embark-town.png
diff --git a/content/data-manipulation/reading/aggregation/pictures/grouped-titanic.png b/content/data-manipulation/reading/aggregation/pictures/grouped-titanic.png
diff --git a/content/data-manipulation/reading/recoding-data/_index.md b/content/data-manipulation/reading/recoding-data/_index.md
@@ -0,0 +1,79 @@
++++
+title = "Recoding Data"
+date = 2024-03-12T15:04:03-05:00
+draft = false
+weight = 2
++++
+
+## Creating New Columns
+
+As questions arise during your data exploration and cleaning, you might want to investigate. In this instance, we want to make sure the values we are planning on manipulating remain untouched. One thing we can do is add a new column that will contain our manipulations.
+
+{{% notice blue Example "rocket" %}}
+```python
+import pandas as pd
+
+data["survived_reformatted"] = data["survived"].replace({0 : False, 1: True})
+```
+
+The above code accomplishes the following:
+1. Imports pandas
+1. Creates a new column called `survived_reformatted` from the `survived` column after replacing all `0` and `1` integers with either `False`, or `True`.
+
+Viewing the output of the dataframe you are able to see that a new `column` called `survived_reformatted` was created:
+
+![Displaying the output of our dataframe using the data.head() function](pictures/survived-reformatted.png?classes=border)
+{{% /notice %}}
+
+## Replacing Values
+
+Replacing values within a column to be more data friendly is a common practice. In particular, replacing strings of data to bools, where a "yes" or "no" would become `True` or `False`. We can accomplish this by using the `.replace` function.
+
+The following example simply replaces the data that exists within the column, manipulating it directly as it is, without creating a new column from the manipulation itself.
+
+{{% notice blue Example "rocket" %}}
+Replace the "0" and "1" integer values within the `survived` column of the Titanic dataset to `True` or `False` by passing in a dictionary as an argument into the `to_replace`.
+
+```python
+import pandas as pd
+
+data["survived"] = data["survived"].replace(to_replace={0: False, 1: True})
+```
+{{% /notice %}}
+
+## Using Functions to Manipulate Data
+
+Creating a function to aggregate data or create new columns is another common practice used when analyzing data. Pandas utilizes the `.apply()` method to execute a function on a pandas Series or DataFrame.
+
+{{% notice blue Example "rocket" %}}
+SUppose you wanted to know how many survivors under the age of 20 are still alive from the titanic dataset:
+
+```python
+import pandas as pd
+
+data = pd.read_csv("titanic.csv")
+
+def under_age_21_survivors(data):
+    age = data['age']
+    alive = data['alive']
+
+    if age <= 20 and alive == "yes":
+        return True
+    else:
+        return False
+
+data["under_21_survivors"] = data.apply(under_age_21_survivors, axis=1)
+print(data["under_21_survivors"].value_counts())
+```
+
+**Output**
+
+![pandas function that applies conditional formatting to a dataframe checking if survivors under the age of 21 are still alive](pictures/under-age-21-survivors.png?classes=border)
+{{% /notice %}}
+
+## Summary
+
+When recoding your data there are some things you should think about:
+1. Does the original data need to remain in-tact?
+1. What data tyes should be replaced with new values, and what type of data should the new value be?
+1. Would a function be useful for repetitive tasks and manipulation?
diff --git a/content/data-manipulation/reading/recoding-data/pictures/survived-reformatted.png b/content/data-manipulation/reading/recoding-data/pictures/survived-reformatted.png
diff --git a/...ent/data-manipulation/reading/recoding-data/pictures/under-age-21-survivors.png b/...ent/data-manipulation/reading/recoding-data/pictures/under-age-21-survivors.png