Welcome! This notebook is an unofficial resource created by the Data Science division.
It serves as an introduction to working with Python's widely used Pandas library for students who have taken data 8. The functions introduced will be analogous to those in Berkeley's datascience
module, with examples provided for each.
We will cover the following topics in this notebook:
--
-2. [Dataframes: Working with Tabular Data](#dataframes) - - [Creating a Dataframe](#creating) - - [Accessing Values in Dataframe](#accessing) - - [Manipulating Data](#manipulating) -
-
-3. [Visualizing Data](#visualizing) - - [Histograms](#histograms) - - [Line Plots](#line) - - [Scatter Plots](#scatter) - - [Bar Plots](#bar) -
This notebook assumes familiarity with Python concepts, syntax and data structures at the level of Data 8. For a brief refresher on some Python concepts, refer to this Python Basics Guide on Github
Python has a great ecosystem of data-centric packages which makes it excellent for data analysis. Pandas is one of those packages, and makes importing and analyzing data much easier. Pandas builds on packages like NumPy and matplotlib to give us a single, convenient, place to do most of our data analysis and visualization work.
-It is useful to import certain packages in our workspace for analysis and data visualization. But first, we may need to install these package if they are not present already. We do this via the command line as follows:
-# run this cell to install packages
+
+# run this cell to install packages
!pip install datascience
!pip install pandas
!pip install numpy
-
-
Collecting datascience - Downloading datascience-0.17.6-py3-none-any.whl (732 kB) - ━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━ 732.8/732.8 kB 3.3 MB/s eta 0:00:0000:0100:01 -Requirement already satisfied: folium>=0.9.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (0.14.0) -Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (67.7.2) -Requirement already satisfied: matplotlib>=3.0.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (3.5.3) -Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.4.3) -Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.9.0) -Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (1.23.2) -Requirement already satisfied: ipython in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (8.14.0) -Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (5.10.0) -Requirement already satisfied: branca in /srv/conda/envs/notebook/lib/python3.9/site-packages (from datascience) (0.6.0) -Requirement already satisfied: jinja2>=2.9 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from folium>=0.9.1->datascience) (3.1.2) -Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.9/site-packages (from folium>=0.9.1->datascience) (2.28.1) -Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (0.11.0) -Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (4.40.0) -Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (1.4.4) -Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (23.1) -Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (9.2.0) -Requirement already satisfied: pyparsing>=2.2.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (3.0.9) -Requirement already satisfied: python-dateutil>=2.7 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from matplotlib>=3.0.0->datascience) (2.8.2) -Requirement already satisfied: backcall in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.2.0) -Requirement already satisfied: decorator in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (5.1.1) -Requirement already satisfied: jedi>=0.16 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.18.2) -Requirement already satisfied: matplotlib-inline in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.1.6) -Requirement already satisfied: pickleshare in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.7.5) -Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (3.0.38) -Requirement already satisfied: pygments>=2.4.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (2.15.1) -Requirement already satisfied: stack-data in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (0.6.2) -Requirement already satisfied: traitlets>=5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (5.9.0) -Requirement already satisfied: typing-extensions in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (4.6.3) -Requirement already satisfied: pexpect>4.3 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from ipython->datascience) (4.8.0) -Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas->datascience) (2023.3) -Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from plotly->datascience) (8.2.2) -Requirement already satisfied: parso<0.9.0,>=0.8.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from jedi>=0.16->ipython->datascience) (0.8.3) -Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from jinja2>=2.9->folium>=0.9.1->datascience) (2.1.3) -Requirement already satisfied: ptyprocess>=0.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pexpect>4.3->ipython->datascience) (0.7.0) -Requirement already satisfied: wcwidth in /srv/conda/envs/notebook/lib/python3.9/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->datascience) (0.2.6) -Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->datascience) (1.16.0) -Requirement already satisfied: charset-normalizer<3,>=2 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (2.1.1) -Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (3.4) -Requirement already satisfied: urllib3<1.27,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (1.26.15) -Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from requests->folium>=0.9.1->datascience) (2023.5.7) -Requirement already satisfied: executing>=1.2.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (1.2.0) -Requirement already satisfied: asttokens>=2.1.0 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (2.2.1) -Requirement already satisfied: pure-eval in /srv/conda/envs/notebook/lib/python3.9/site-packages (from stack-data->ipython->datascience) (0.2.2) -Installing collected packages: datascience -Successfully installed datascience-0.17.6 -Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.9/site-packages (1.4.3) -Requirement already satisfied: python-dateutil>=2.8.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (2.8.2) -Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (2023.3) -Requirement already satisfied: numpy>=1.18.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from pandas) (1.23.2) -Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.9/site-packages (from python-dateutil>=2.8.1->pandas) (1.16.0) -Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.9/site-packages (1.23.2) + ++Requirement already satisfied: datascience in /srv/conda/envs/notebook/lib/python3.11/site-packages (0.17.6) +Requirement already satisfied: folium>=0.9.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (0.14.0) +Requirement already satisfied: setuptools in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (68.2.2) +Requirement already satisfied: matplotlib>=3.0.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (3.7.1) +Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (2.0.2) +Requirement already satisfied: scipy in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (1.10.1) +Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (1.24.2) +Requirement already satisfied: ipython in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (8.16.1) +Requirement already satisfied: plotly in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (5.13.1) +Requirement already satisfied: branca in /srv/conda/envs/notebook/lib/python3.11/site-packages (from datascience) (0.7.2) +Requirement already satisfied: jinja2>=2.9 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from folium>=0.9.1->datascience) (3.1.2) +Requirement already satisfied: requests in /srv/conda/envs/notebook/lib/python3.11/site-packages (from folium>=0.9.1->datascience) (2.32.3) +Requirement already satisfied: contourpy>=1.0.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (1.2.1) +Requirement already satisfied: cycler>=0.10 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (0.12.1) +Requirement already satisfied: fonttools>=4.22.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (4.53.1) +Requirement already satisfied: kiwisolver>=1.0.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (1.4.5) +Requirement already satisfied: packaging>=20.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (23.2) +Requirement already satisfied: pillow>=6.2.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (10.0.1) +Requirement already satisfied: pyparsing>=2.3.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (3.1.2) +Requirement already satisfied: python-dateutil>=2.7 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from matplotlib>=3.0.0->datascience) (2.8.2) +Requirement already satisfied: backcall in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (0.2.0) +Requirement already satisfied: decorator in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (5.1.1) +Requirement already satisfied: jedi>=0.16 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (0.19.1) +Requirement already satisfied: matplotlib-inline in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (0.1.6) +Requirement already satisfied: pickleshare in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (0.7.5) +Requirement already satisfied: prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (3.0.39) +Requirement already satisfied: pygments>=2.4.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (2.18.0) +Requirement already satisfied: stack-data in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (0.6.2) +Requirement already satisfied: traitlets>=5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (5.11.2) +Requirement already satisfied: pexpect>4.3 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from ipython->datascience) (4.8.0) +Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas->datascience) (2023.3.post1) +Requirement already satisfied: tzdata>=2022.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas->datascience) (2024.1) +Requirement already satisfied: tenacity>=6.2.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from plotly->datascience) (9.0.0) +Requirement already satisfied: parso<0.9.0,>=0.8.3 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from jedi>=0.16->ipython->datascience) (0.8.3) +Requirement already satisfied: MarkupSafe>=2.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from jinja2>=2.9->folium>=0.9.1->datascience) (2.1.3) +Requirement already satisfied: ptyprocess>=0.5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pexpect>4.3->ipython->datascience) (0.7.0) +Requirement already satisfied: wcwidth in /srv/conda/envs/notebook/lib/python3.11/site-packages (from prompt-toolkit!=3.0.37,<3.1.0,>=3.0.30->ipython->datascience) (0.2.8) +Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from python-dateutil>=2.7->matplotlib>=3.0.0->datascience) (1.16.0) +Requirement already satisfied: charset-normalizer<4,>=2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests->folium>=0.9.1->datascience) (3.3.0) +Requirement already satisfied: idna<4,>=2.5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests->folium>=0.9.1->datascience) (3.4) +Requirement already satisfied: urllib3<3,>=1.21.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests->folium>=0.9.1->datascience) (1.26.19) +Requirement already satisfied: certifi>=2017.4.17 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from requests->folium>=0.9.1->datascience) (2024.7.4) +Requirement already satisfied: executing>=1.2.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from stack-data->ipython->datascience) (1.2.0) +Requirement already satisfied: asttokens>=2.1.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from stack-data->ipython->datascience) (2.4.0) +Requirement already satisfied: pure-eval in /srv/conda/envs/notebook/lib/python3.11/site-packages (from stack-data->ipython->datascience) (0.2.2) +Requirement already satisfied: pandas in /srv/conda/envs/notebook/lib/python3.11/site-packages (2.0.2) +Requirement already satisfied: python-dateutil>=2.8.2 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas) (2.8.2) +Requirement already satisfied: pytz>=2020.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas) (2023.3.post1) +Requirement already satisfied: tzdata>=2022.1 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas) (2024.1) +Requirement already satisfied: numpy>=1.21.0 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from pandas) (1.24.2) +Requirement already satisfied: six>=1.5 in /srv/conda/envs/notebook/lib/python3.11/site-packages (from python-dateutil>=2.8.2->pandas) (1.16.0) +Requirement already satisfied: numpy in /srv/conda/envs/notebook/lib/python3.11/site-packages (1.24.2)
Once we have installed the required packages, we do not need to reinstall them again when we start or reopen a Jupyter notebook. We can simply import them using the import
keyword. Since we import Pandas as pd
, we need to prefix all functions with pd
, similar to how we prefix all numpy functions with np
(such as np.append()
).
# run this cell to import the following packages
+
+# run this cell to import the following packages
from datascience import * # import the datascience package
import pandas as pd # import the pandas library. pd is a common shorthand for pandas
import numpy as np # import numpy for working with numbers
-
-
In Python's datascience
module, we used Table
to build our dataframes and used commands such as select()
, where()
, group()
, column()
etc. In this section, we will go over some basic commands to work with tabular data in Pandas
Pandas introduces a data structure (i.e. dataframe) that represents data as a table with columns and rows.
In Python's datascience
module that is used in Data 8, this is how we created tables from scratch by extending an empty table:
t = Table().with_columns([
- 'letter', ['a', 'b', 'c', 'z'],
- 'count', [ 9, 3, 3, 1],
- 'points', [ 1, 2, 2, 10],
+
+t = Table().with_columns([
+ 'letter', ['a', 'b', 'c', 'z'],
+ 'count', [ 9, 3, 3, 1],
+ 'points', [ 1, 2, 2, 10],
])
t
-
-
letter | count | points | -
---|---|---|
a | 9 | 1 | -
b | 3 | 2 | -
c | 3 | 2 | -
z | 1 | 10 | -
letter | count | points | +
a | 9 | 1 | +
b | 3 | 2 | +
c | 3 | 2 | +
z | 1 | 10 | +
In Pandas, we can use the function pd.DataFrame
to initialize a dataframe from a dictionary or a list-like object. Refer to the documentation for more information
# example: creating a dataframe from a dictionary
-df_from_dict = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
- 'count' : [ 9, 3, 3, 1],
- 'points' : [ 1, 2, 2, 10]
+
+# example: creating a dataframe from a dictionary
+df_from_dict = pd.DataFrame({ 'letter' : ['a', 'b', 'c', 'z'],
+ 'count' : [ 9, 3, 3, 1],
+ 'points' : [ 1, 2, 2, 10]
})
df_from_dict
-
-
- | letter | -count | -points | -
---|---|---|---|
0 | -a | -9 | -1 | -
1 | -b | -3 | -2 | -
2 | -c | -3 | -2 | -
3 | -z | -1 | -10 | -
+ | letter | +count | +points | +
0 | +a | +9 | +1 | +
1 | +b | +3 | +2 | +
2 | +c | +3 | +2 | +
3 | +z | +1 | +10 | +
More often, we will need to create a dataframe by importing data from a .csv file. In datascience
, this is how we read data from a csv:
datascience_baby = Table.read_table('data/baby.csv')
+
+datascience_baby = Table.read_table('data/baby.csv')
datascience_baby
-
-
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | -|
---|---|---|---|---|---|---|
120 | 284 | 27 | 62 | 100 | False | -|
113 | 282 | 33 | 64 | 135 | False | -|
128 | 279 | 28 | 64 | 115 | True | -|
108 | 282 | 23 | 67 | 125 | True | -|
136 | 286 | 25 | 62 | 93 | False | -|
138 | 244 | 33 | 62 | 178 | False | -|
132 | 245 | 23 | 65 | 140 | False | -|
120 | 289 | 25 | 62 | 125 | False | -|
143 | 299 | 30 | 66 | 136 | True | -|
140 | 351 | 27 | 68 | 120 | False | -|
Unnamed: 0 | Birth.Weight | Gestational.Days | Maternal.Age | Maternal.Height | Maternal.Pregnancy.Weight | Maternal.Smoker | +
1 | 120 | 284 | 27 | 62 | 100 | False | +
2 | 113 | 282 | 33 | 64 | 135 | False | +
3 | 128 | 279 | 28 | 64 | 115 | True | +
4 | 108 | 282 | 23 | 67 | 125 | True | +
5 | 136 | 286 | 25 | 62 | 93 | False | +
6 | 138 | 244 | 33 | 62 | 178 | False | +
7 | 132 | 245 | 23 | 65 | 140 | False | +
8 | 120 | 289 | 25 | 62 | 125 | False | +
9 | 143 | 299 | 30 | 66 | 136 | True | +
10 | 140 | 351 | 27 | 68 | 120 | False | +
... (1164 rows omitted)
In Pandas, we use pd.read.csv()
to read data from a csv file. Sometimes, depending on the data file, we may need to specify the parameters sep
, header
or encoding
as well. For a full list of parameters, refer to this guide
# reading baby.csv (located in current working directory)
-baby = pd.read_csv('data/baby.csv')
+
+# reading baby.csv (located in current working directory)
+baby = pd.read_csv('data/baby.csv')
baby.head() # display first few rows of dataframe
-
-
- | Birth Weight | -Gestational Days | -Maternal Age | -Maternal Height | -Maternal Pregnancy Weight | -Maternal Smoker | -|
---|---|---|---|---|---|---|---|
0 | -120 | -284 | -27 | -62 | -100 | -False | -|
1 | -113 | -282 | -33 | -64 | -135 | -False | -|
2 | -128 | -279 | -28 | -64 | -115 | -True | -|
3 | -108 | -282 | -23 | -67 | -125 | -True | -|
4 | -136 | -286 | -25 | -62 | -93 | -False | -|
+ | Unnamed: 0 | +Birth.Weight | +Gestational.Days | +Maternal.Age | +Maternal.Height | +Maternal.Pregnancy.Weight | +Maternal.Smoker | +
0 | +1 | +120 | +284 | +27 | +62 | +100 | +False | +
1 | +2 | +113 | +282 | +33 | +64 | +135 | +False | +
2 | +3 | +128 | +279 | +28 | +64 | +115 | +True | +
3 | +4 | +108 | +282 | +23 | +67 | +125 | +True | +
4 | +5 | +136 | +286 | +25 | +62 | +93 | +False | +
# view summary of data
+
+# view summary of data
baby.describe()
-
-
- | Birth Weight | -Gestational Days | -Maternal Age | -Maternal Height | -Maternal Pregnancy Weight | -|
---|---|---|---|---|---|---|
count | -1174.000000 | -1174.000000 | -1174.000000 | -1174.000000 | -1174.000000 | -|
mean | -119.462521 | -279.101363 | -27.228279 | -64.049404 | -128.478705 | -|
std | -18.328671 | -16.010305 | -5.817839 | -2.526102 | -20.734282 | -|
min | -55.000000 | -148.000000 | -15.000000 | -53.000000 | -87.000000 | -|
25% | -108.000000 | -272.000000 | -23.000000 | -62.000000 | -114.250000 | -|
50% | -120.000000 | -280.000000 | -26.000000 | -64.000000 | -125.000000 | -|
75% | -131.000000 | -288.000000 | -31.000000 | -66.000000 | -139.000000 | -|
max | -176.000000 | -353.000000 | -45.000000 | -72.000000 | -250.000000 | -|
+ | Unnamed: 0 | +Birth.Weight | +Gestational.Days | +Maternal.Age | +Maternal.Height | +Maternal.Pregnancy.Weight | +
count | +1174.000000 | +1174.000000 | +1174.000000 | +1174.000000 | +1174.000000 | +1174.000000 | +
mean | +587.500000 | +119.462521 | +279.101363 | +27.228279 | +64.049404 | +128.478705 | +
std | +339.048915 | +18.328671 | +16.010305 | +5.817839 | +2.526102 | +20.734282 | +
min | +1.000000 | +55.000000 | +148.000000 | +15.000000 | +53.000000 | +87.000000 | +
25% | +294.250000 | +108.000000 | +272.000000 | +23.000000 | +62.000000 | +114.250000 | +
50% | +587.500000 | +120.000000 | +280.000000 | +26.000000 | +64.000000 | +125.000000 | +
75% | +880.750000 | +131.000000 | +288.000000 | +31.000000 | +66.000000 | +139.000000 | +
max | +1174.000000 | +176.000000 | +353.000000 | +45.000000 | +72.000000 | +250.000000 | +
# example: loading csv from URL
-sat = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/sat2014.csv')
+
+# example: loading csv from URL
+sat = pd.read_csv('https://raw.githubusercontent.com/data-8/materials-sp18/master/lec/sat2014.csv')
sat.head()
-
-
- | State | -Participation Rate | -Critical Reading | -Math | -Writing | -Combined | -
---|---|---|---|---|---|---|
0 | -North Dakota | -2.3 | -612 | -620 | -584 | -1816 | -
1 | -Illinois | -4.6 | -599 | -616 | -587 | -1802 | -
2 | -Iowa | -3.1 | -605 | -611 | -578 | -1794 | -
3 | -South Dakota | -2.9 | -604 | -609 | -579 | -1792 | -
4 | -Minnesota | -5.9 | -598 | -610 | -578 | -1786 | -
+ | State | +Participation Rate | +Critical Reading | +Math | +Writing | +Combined | +
0 | +North Dakota | +2.3 | +612 | +620 | +584 | +1816 | +
1 | +Illinois | +4.6 | +599 | +616 | +587 | +1802 | +
2 | +Iowa | +3.1 | +605 | +611 | +578 | +1794 | +
3 | +South Dakota | +2.9 | +604 | +609 | +579 | +1792 | +
4 | +Minnesota | +5.9 | +598 | +610 | +578 | +1786 | +
# view information about dataframe
+
+# view information about dataframe
print(sat.shape) # view dimensions (rows, cols)
print(sat.columns.values) # view column names
-
-
(51, 6) -['State' 'Participation Rate' 'Critical Reading' 'Math' 'Writing' - 'Combined'] +['State' 'Participation Rate' 'Critical Reading' 'Math' 'Writing' + 'Combined']
In datascience
, we can use column()
to access values in a particular column as follows:
# access column 'letter'. returns array
-t.column('letter')
+
+# access column 'letter'. returns array
+t.column('letter')
-
-
array(['a', 'b', 'c', 'z'], - dtype='<U1')+
array(['a', 'b', 'c', 'z'], + dtype='<U1')
In Pandas, columns are also known as Series. We can access a Pandas series by using the square bracket notation.
-# returns Series object
-sat['State']
+
+# returns Series object
+sat['State']
-
-
0 North Dakota 1 Illinois 2 Iowa @@ -16304,183 +8491,143 @@2.2 Accessing Values in Dataframe < 50 District of Columbia Name: State, dtype: object
If we want a numpy array of column values, we can call the method values
on a Series object:
sat['State'].values
+
+sat['State'].values
-
-
array(['North Dakota', 'Illinois', 'Iowa', 'South Dakota', 'Minnesota', - 'Michigan', 'Wisconsin', 'Missouri', 'Wyoming', 'Kansas', - 'Kentucky', 'Nebraska', 'Colorado', 'Mississippi', 'Tennessee', - 'Arkansas', 'Oklahoma', 'Utah', 'Louisiana', 'Ohio', 'Montana', - 'Alabama', 'New Mexico', 'New Hampshire', 'Massachusetts', - 'Vermont', 'Arizona', 'Oregon', 'Virginia', 'New Jersey', - 'Connecticut', 'West Virginia', 'Washington', 'California', - 'Alaska', 'North Carolina', 'Pennsylvania', 'Rhode Island', - 'Indiana', 'Maryland', 'New York', 'Hawaii', 'Nevada', 'Florida', - 'Georgia', 'South Carolina', 'Texas', 'Maine', 'Idaho', 'Delaware', - 'District of Columbia'], dtype=object)+
array(['North Dakota', 'Illinois', 'Iowa', 'South Dakota', 'Minnesota', + 'Michigan', 'Wisconsin', 'Missouri', 'Wyoming', 'Kansas', + 'Kentucky', 'Nebraska', 'Colorado', 'Mississippi', 'Tennessee', + 'Arkansas', 'Oklahoma', 'Utah', 'Louisiana', 'Ohio', 'Montana', + 'Alabama', 'New Mexico', 'New Hampshire', 'Massachusetts', + 'Vermont', 'Arizona', 'Oregon', 'Virginia', 'New Jersey', + 'Connecticut', 'West Virginia', 'Washington', 'California', + 'Alaska', 'North Carolina', 'Pennsylvania', 'Rhode Island', + 'Indiana', 'Maryland', 'New York', 'Hawaii', 'Nevada', 'Florida', + 'Georgia', 'South Carolina', 'Texas', 'Maine', 'Idaho', 'Delaware', + 'District of Columbia'], dtype=object)
In datascience
, we used take()
to access a row in the Table:
# selecting first two rows using Python's slicing notation
+
+# selecting first two rows using Python's slicing notation
t.take[0:2]
-
-
letter | count | points | -
---|---|---|
a | 9 | 1 | -
b | 3 | 2 | -
letter | count | points | +
a | 9 | 1 | +
b | 3 | 2 | +
In Pandas, we can access rows and column by their position using the iloc
method. We need to specify the rows and columns we want in the following syntax: df.iloc[<rows>, <columns>]
. For more information on indexing, refer to this guide
# selecting first two rows using iloc
+
+# selecting first two rows using iloc
baby.iloc[0:2, :]
-
-
- | Birth Weight | -Gestational Days | -Maternal Age | -Maternal Height | -Maternal Pregnancy Weight | -Maternal Smoker | -|
---|---|---|---|---|---|---|---|
0 | -120 | -284 | -27 | -62 | -100 | -False | -|
1 | -113 | -282 | -33 | -64 | -135 | -False | -|
+ | Unnamed: 0 | +Birth.Weight | +Gestational.Days | +Maternal.Age | +Maternal.Height | +Maternal.Pregnancy.Weight | +Maternal.Smoker | +
0 | +1 | +120 | +284 | +27 | +62 | +100 | +False | +
1 | +2 | +113 | +282 | +33 | +64 | +135 | +False | +
# specifying row indices
+
+# specifying row indices
baby.iloc[[1, 4, 6], :]
-
-
- | Birth Weight | -Gestational Days | -Maternal Age | -Maternal Height | -Maternal Pregnancy Weight | -Maternal Smoker | -|
---|---|---|---|---|---|---|---|
1 | -113 | -282 | -33 | -64 | -135 | -False | -|
4 | -136 | -286 | -25 | -62 | -93 | -False | -|
6 | -132 | -245 | -23 | -65 | -140 | -False | -|
+ | Unnamed: 0 | +Birth.Weight | +Gestational.Days | +Maternal.Age | +Maternal.Height | +Maternal.Pregnancy.Weight | +Maternal.Smoker | +
1 | +2 | +113 | +282 | +33 | +64 | +135 | +False | +
4 | +5 | +136 | +286 | +25 | +62 | +93 | +False | +
6 | +7 | +132 | +245 | +23 | +65 | +140 | +False | +
We can also access a specific value in the dataframe by passing in the row and column indices:
-# get value in second row, third column
+
+# get value in second row, third column
baby.iloc[1, 2]
-
-
33+
282
Adding Columns
Adding a new column in datascience
is done by the with_column()
function as follows:
t.with_column('vowel', ['yes', 'no', 'no', 'no'])
+
+t.with_column('vowel', ['yes', 'no', 'no', 'no'])
t
-
-
letter | count | points | -
---|---|---|
a | 9 | 1 | -
b | 3 | 2 | -
c | 3 | 2 | -
z | 1 | 10 | -
letter | count | points | +
a | 9 | 1 | +
b | 3 | 2 | +
c | 3 | 2 | +
z | 1 | 10 | +
In Pandas, we can use the bracket notation and assign a list to add to the dataframe as follows:
-# adding a new column
-df_from_dict['newcol'] = [5, 6, 7, 8]
+
+# adding a new column
+df_from_dict['newcol'] = [5, 6, 7, 8]
df_from_dict
-
-
- | letter | -count | -points | -newcol | -
---|---|---|---|---|
0 | -a | -9 | -1 | -5 | -
1 | -b | -3 | -2 | -6 | -
2 | -c | -3 | -2 | -7 | -
3 | -z | -1 | -10 | -8 | -
+ | letter | +count | +points | +newcol | +
0 | +a | +9 | +1 | +5 | +
1 | +b | +3 | +2 | +6 | +
2 | +c | +3 | +2 | +7 | +
3 | +z | +1 | +10 | +8 | +
We can also add an existing column to the new dataframe as a new column by performing an operation on it:
-# adding count * 2 to the dataframe
-df_from_dict['doublecount'] = df_from_dict['count'] * 2
+
+# adding count * 2 to the dataframe
+df_from_dict['doublecount'] = df_from_dict['count'] * 2
df_from_dict
-
-
- | letter | -count | -points | -newcol | -doublecount | -
---|---|---|---|---|---|
0 | -a | -9 | -1 | -5 | -18 | -
1 | -b | -3 | -2 | -6 | -6 | -
2 | -c | -3 | -2 | -7 | -6 | -
3 | -z | -1 | -10 | -8 | -2 | -
+ | letter | +count | +points | +newcol | +doublecount | +
0 | +a | +9 | +1 | +5 | +18 | +
1 | +b | +3 | +2 | +6 | +6 | +
2 | +c | +3 | +2 | +7 | +6 | +
3 | +z | +1 | +10 | +8 | +2 | +
Selecting Columns
In datascience
, we used select()
to subset the dataframe by selecting columns:
t.select(['letter', 'points'])
+
+t.select(['letter', 'points'])
-
-
letter | points | -
---|---|
a | 1 | -
b | 2 | -
c | 2 | -
z | 10 | -
letter | points | +
a | 1 | +
b | 2 | +
c | 2 | +
z | 10 | +
In Pandas, we use a double bracket notation to select columns. This returns a dataframe, unlike a Series object when we only use single bracket notation
-# double bracket notation for new dataframe
-df_from_dict[['count', 'doublecount']]
+
+# double bracket notation for new dataframe
+df_from_dict[['count', 'doublecount']]
-
-
- | count | -doublecount | -
---|---|---|
0 | -9 | -18 | -
1 | -3 | -6 | -
2 | -3 | -6 | -
3 | -1 | -2 | -
+ | count | +doublecount | +
0 | +9 | +18 | +
1 | +3 | +6 | +
2 | +3 | +6 | +
3 | +1 | +2 | +
Filtering Rows Conditionally
In datascience
, we used where()
to select rows according to a given condition:
t.where('points', 2) # rows where points == 2
+
+t.where('points', 2) # rows where points == 2
-
-
letter | count | points | -
---|---|---|
b | 3 | 2 | -
c | 3 | 2 | -
letter | count | points | +
b | 3 | 2 | +
c | 3 | 2 | +
t.where(t['count'] < 8) # rows where count < 8
+
+t.where(t['count'] < 8) # rows where count < 8
-
-
letter | count | points | -
---|---|---|
b | 3 | 2 | -
c | 3 | 2 | -
z | 1 | 10 | -
letter | count | points | +
b | 3 | 2 | +
c | 3 | 2 | +
z | 1 | 10 | +
In Pandas, we can use the bracket notation to subset the dataframe based on a condition. We first specify a condition and then subset using the bracket notation:
-# array of booleans
-baby['Maternal Smoker'] == True
+
+# array of booleans
+baby['Maternal.Smoker'] == True
-
-
0 False 1 False 2 True @@ -17325,47 +9349,35 @@2.3 Manipulating Data +Name: Maternal.Smoker, Length: 1174, dtype: bool
# filter rows by condition Maternal.Smoker == True
-baby[baby['Maternal Smoker'] == True]
+
+# filter rows by condition Maternal.Smoker == True
+baby[baby['Maternal.Smoker'] == True]
-
-
- | Birth Weight | -Gestational Days | -Maternal Age | -Maternal Height | -Maternal Pregnancy Weight | -Maternal Smoker | -|
---|---|---|---|---|---|---|---|
2 | -128 | -279 | -28 | -64 | -115 | -True | -|
3 | -108 | -282 | -23 | -67 | -125 | -True | -|
8 | -143 | -299 | -30 | -66 | -136 | -True | -|
10 | -144 | -282 | -32 | -64 | -124 | -True | -|
11 | -141 | -279 | -23 | -63 | -128 | -True | -|
... | -... | -... | -... | -... | -... | -... | -|
1162 | -143 | -281 | -28 | -65 | -135 | -True | -|
1163 | -113 | -287 | -29 | -70 | -145 | -True | -|
1164 | -109 | -244 | -21 | -63 | -102 | -True | -|
1165 | -103 | -278 | -30 | -60 | -87 | -True | -|
1171 | -130 | -291 | -30 | -65 | -150 | -True | -|
+ | Unnamed: 0 | +Birth.Weight | +Gestational.Days | +Maternal.Age | +Maternal.Height | +Maternal.Pregnancy.Weight | +Maternal.Smoker | +
2 | +3 | +128 | +279 | +28 | +64 | +115 | +True | +
3 | +4 | +108 | +282 | +23 | +67 | +125 | +True | +
8 | +9 | +143 | +299 | +30 | +66 | +136 | +True | +
10 | +11 | +144 | +282 | +32 | +64 | +124 | +True | +
11 | +12 | +141 | +279 | +23 | +63 | +128 | +True | +
... | +... | +... | +... | +... | +... | +... | +... | +
1162 | +1163 | +143 | +281 | +28 | +65 | +135 | +True | +
1163 | +1164 | +113 | +287 | +29 | +70 | +145 | +True | +
1164 | +1165 | +109 | +244 | +21 | +63 | +102 | +True | +
1165 | +1166 | +103 | +278 | +30 | +60 | +87 | +True | +
1171 | +1172 | +130 | +291 | +30 | +65 | +150 | +True | +
459 rows × 6 columns
+459 rows × 7 columns
# filtering with multiple conditions
-df_from_dict[(df_from_dict['count'] < 8) & (df_from_dict['points'] > 5)]
+
+# filtering with multiple conditions
+df_from_dict[(df_from_dict['count'] < 8) & (df_from_dict['points'] > 5)]
-
-
- | letter | -count | -points | -newcol | -doublecount | -
---|---|---|---|---|---|
3 | -z | -1 | -10 | -8 | -2 | -
+ | letter | +count | +points | +newcol | +doublecount | +
3 | +z | +1 | +10 | +8 | +2 | +
Renaming Columns
In datascience
, we used relabeled()
to rename columns:
# rename 'points' to 'other name'
-t.relabeled('points', 'other name')
+
+# rename 'points' to 'other name'
+t.relabeled('points', 'other name')
-
-
letter | count | other name | -
---|---|---|
a | 9 | 1 | -
b | 3 | 2 | -
c | 3 | 2 | -
z | 1 | 10 | -
letter | count | other name | +
a | 9 | 1 | +
b | 3 | 2 | +
c | 3 | 2 | +
z | 1 | 10 | +
Pandas uses rename()
, which has an index
parameter that needs to be set to str
and a columns
parameter that needs to be set to a dictionary of the names to be replaced with their replacements:
# rename 'points' to 'other name'
-df_from_dict.rename(index = str, columns = {"points" : "other name"})
+
+# rename 'points' to 'other name'
+df_from_dict.rename(index = str, columns = {"points" : "other name"})
-
-
- | letter | -count | -other name | -newcol | -doublecount | -
---|---|---|---|---|---|
0 | -a | -9 | -1 | -5 | -18 | -
1 | -b | -3 | -2 | -6 | -6 | -
2 | -c | -3 | -2 | -7 | -6 | -
3 | -z | -1 | -10 | -8 | -2 | -
+ | letter | +count | +other name | +newcol | +doublecount | +
0 | +a | +9 | +1 | +5 | +18 | +
1 | +b | +3 | +2 | +6 | +6 | +
2 | +c | +3 | +2 | +7 | +6 | +
3 | +z | +1 | +10 | +8 | +2 | +
Sorting Dataframe by Column
In datascience
we used sort()
to sort a Table according to the values in a column:
# sort by count
-t.sort('count')
+
+# sort by count
+t.sort('count')
-
-
letter | count | points | -
---|---|---|
z | 1 | 10 | -
b | 3 | 2 | -
c | 3 | 2 | -
a | 9 | 1 | -
letter | count | points | +
z | 1 | 10 | +
b | 3 | 2 | +
c | 3 | 2 | +
a | 9 | 1 | +
In Pandas, we use the sort_values()
to sort by column. We need the by
parameter to specify the row we want to sort by and the optional parameter ascending = False
if we want to sort in descending order:
# sort by count, descending
-df_from_dict.sort_values(by = ['count'], ascending = False)
+
+# sort by count, descending
+df_from_dict.sort_values(by = ['count'], ascending = False)
-
-
- | letter | -count | -points | -newcol | -doublecount | -
---|---|---|---|---|---|
0 | -a | -9 | -1 | -5 | -18 | -
1 | -b | -3 | -2 | -6 | -6 | -
2 | -c | -3 | -2 | -7 | -6 | -
3 | -z | -1 | -10 | -8 | -2 | -
+ | letter | +count | +points | +newcol | +doublecount | +
0 | +a | +9 | +1 | +5 | +18 | +
1 | +b | +3 | +2 | +6 | +6 | +
2 | +c | +3 | +2 | +7 | +6 | +
3 | +z | +1 | +10 | +8 | +2 | +
Grouping and Aggregating
In datascience
, we used group()
and the collect
argument to group a Table by a column and aggregrate values in another column:
# group by count and aggregate by sum
-t.select(['count', 'points']).group('count', collect=sum)
+
+# group by count and aggregate by sum
+t.select(['count', 'points']).group('count', collect=sum)
-
-
count | points sum | -
---|---|
1 | 10 | -
3 | 4 | -
9 | 1 | -
count | points sum | +
1 | 10 | +
3 | 4 | +
9 | 1 | +
In Pandas, we use groupby()
to group the dataframe. This function returns a groupby object, on which we can then call an aggregation function to return a dataframe with aggregated values for other columns. For more information, refer to the documentation
# selecting two columns for brevity
-df_subset = df_from_dict[['count', 'points']]
+
+# selecting two columns for brevity
+df_subset = df_from_dict[['count', 'points']]
df_subset
-
-
- | count | -points | -
---|---|---|
0 | -9 | -1 | -
1 | -3 | -2 | -
2 | -3 | -2 | -
3 | -1 | -10 | -
+ | count | +points | +
0 | +9 | +1 | +
1 | +3 | +2 | +
2 | +3 | +2 | +
3 | +1 | +10 | +
count_sums_df = df_subset.groupby(['count']).sum()
+
+count_sums_df = df_subset.groupby(['count']).sum()
count_sums_df
-
-
- | points | -
---|---|
count | -- |
1 | -10 | -
3 | -4 | -
9 | -1 | -
+ | points | +
count | ++ |
1 | +10 | +
3 | +4 | +
9 | +1 | +
Pivot Tables
In datascience
, we used the pivot()
function to build contingency tables:
# creating new Table
+
+# creating new Table
cones_tbl = Table().with_columns(
- 'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
- 'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
- 'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
+ 'Flavor', make_array('strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'),
+ 'Color', make_array('pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'),
+ 'Price', make_array(3.55, 4.75, 5.25, 5.25, 5.25, 4.75)
)
cones_tbl
-
-
Flavor | Color | Price | -
---|---|---|
strawberry | pink | 3.55 | -
chocolate | light brown | 4.75 | -
chocolate | dark brown | 5.25 | -
strawberry | pink | 5.25 | -
chocolate | dark brown | 5.25 | -
bubblegum | pink | 4.75 | -
Flavor | Color | Price | +
strawberry | pink | 3.55 | +
chocolate | light brown | 4.75 | +
chocolate | dark brown | 5.25 | +
strawberry | pink | 5.25 | +
chocolate | dark brown | 5.25 | +
bubblegum | pink | 4.75 | +
# pivoting on color and flavor
-cones_tbl.pivot("Flavor", "Color")
+
+# pivoting on color and flavor
+cones_tbl.pivot("Flavor", "Color")
-
-
Color | bubblegum | chocolate | strawberry | -
---|---|---|---|
dark brown | 0 | 2 | 0 | -
light brown | 0 | 1 | 0 | -
pink | 1 | 0 | 2 | -
Color | bubblegum | chocolate | strawberry | +
dark brown | 0 | 2 | 0 | +
light brown | 0 | 1 | 0 | +
pink | 1 | 0 | 2 | +
We can also pass in the parameters values
to specify the values in the table and collect
to specify the aggregration function.
# setting parameters values and collect
-cones_tbl.pivot("Flavor", "Color", values = "Price", collect = np.sum)
+
+# setting parameters values and collect
+cones_tbl.pivot("Flavor", "Color", values = "Price", collect = np.sum)
-
-
Color | bubblegum | chocolate | strawberry | -
---|---|---|---|
dark brown | 0 | 10.5 | 0 | -
light brown | 0 | 4.75 | 0 | -
pink | 4.75 | 0 | 8.8 | -
Color | bubblegum | chocolate | strawberry | +
dark brown | 0 | 10.5 | 0 | +
light brown | 0 | 4.75 | 0 | +
pink | 4.75 | 0 | 8.8 | +
In Pandas, we use pd.pivot_table()
to create a contingency table. The argument columns
is similar to the first argument in datascience
's pivot
function and sets the column names of the pivot table. The argument index
is similar to the second argument in datascience
's pivot
function and sets the first column of the pivot table or the keys to group on. For more information, refer to the documentation
# creating new dataframe
-cones_df = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'],
- "Color" : ['pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'],
- "Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
+
+# creating new dataframe
+cones_df = pd.DataFrame({"Flavor" : ['strawberry', 'chocolate', 'chocolate', 'strawberry', 'chocolate', 'bubblegum'],
+ "Color" : ['pink', 'light brown', 'dark brown', 'pink', 'dark brown', 'pink'],
+ "Price" : [3.55, 4.75, 5.25, 5.25, 5.25, 4.75]})
cones_df
-
-
- | Flavor | -Color | -Price | -
---|---|---|---|
0 | -strawberry | -pink | -3.55 | -
1 | -chocolate | -light brown | -4.75 | -
2 | -chocolate | -dark brown | -5.25 | -
3 | -strawberry | -pink | -5.25 | -
4 | -chocolate | -dark brown | -5.25 | -
5 | -bubblegum | -pink | -4.75 | -
+ | Flavor | +Color | +Price | +
0 | +strawberry | +pink | +3.55 | +
1 | +chocolate | +light brown | +4.75 | +
2 | +chocolate | +dark brown | +5.25 | +
3 | +strawberry | +pink | +5.25 | +
4 | +chocolate | +dark brown | +5.25 | +
5 | +bubblegum | +pink | +4.75 | +
# creating the pivot table
-pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"])
+
+# creating the pivot table
+pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"])
-
-
- | Price | -||
---|---|---|---|
Flavor | -bubblegum | -chocolate | -strawberry | -
Color | -- | - | - |
dark brown | -NaN | -5.25 | -NaN | -
light brown | -NaN | -4.75 | -NaN | -
pink | -4.75 | -NaN | -4.4 | -
+ | Price | +||
Flavor | +bubblegum | +chocolate | +strawberry | +
Color | ++ | + | + |
dark brown | +NaN | +5.25 | +NaN | +
light brown | +NaN | +4.75 | +NaN | +
pink | +4.75 | +NaN | +4.4 | +
If there is no data in the groups, then Pandas will output NaN
values.
We can also specify the parameters like values
(equivalent to values
in datascience
's pivot
) and aggfunc
(equivalent to collect
in datascience
's pivot
)
# additional arguments
-pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"], values = "Price", aggfunc=np.sum)
+
+# additional arguments
+pd.pivot_table(cones_df, columns = ["Flavor"], index = ["Color"], values = "Price", aggfunc=np.sum)
-
-
Flavor | -bubblegum | -chocolate | -strawberry | -
---|---|---|---|
Color | -- | - | - |
dark brown | -NaN | -10.50 | -NaN | -
light brown | -NaN | -4.75 | -NaN | -
pink | -4.75 | -NaN | -8.8 | -
Flavor | +bubblegum | +chocolate | +strawberry | +
Color | ++ | + | + |
dark brown | +NaN | +10.50 | +NaN | +
light brown | +NaN | +4.75 | +NaN | +
pink | +4.75 | +NaN | +8.8 | +
Joining/Merging
In datascience
, we used join()
to join two tables based on shared values in columns. We specify the column name in the first table to match on, the name of the second table and the column name in the second table to match on.
# creating new table
+
+# creating new table
ratings_tbl = Table().with_columns(
- 'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
- 'Stars', make_array(2.5, 3.5, 4)
+ 'Kind', make_array('strawberry', 'chocolate', 'vanilla'),
+ 'Stars', make_array(2.5, 3.5, 4)
)
ratings_tbl
-
-
Kind | Stars | -
---|---|
strawberry | 2.5 | -
chocolate | 3.5 | -
vanilla | 4 | -
Kind | Stars | +
strawberry | 2.5 | +
chocolate | 3.5 | +
vanilla | 4 | +
# joining cones and ratings
-cones_tbl.join("Flavor", ratings_tbl, "Kind")
+
+# joining cones and ratings
+cones_tbl.join("Flavor", ratings_tbl, "Kind")
-
-
Flavor | Color | Price | Stars | -
---|---|---|---|
chocolate | light brown | 4.75 | 3.5 | -
chocolate | dark brown | 5.25 | 3.5 | -
chocolate | dark brown | 5.25 | 3.5 | -
strawberry | pink | 3.55 | 2.5 | -
strawberry | pink | 5.25 | 2.5 | -
Flavor | Color | Price | Stars | +
chocolate | light brown | 4.75 | 3.5 | +
chocolate | dark brown | 5.25 | 3.5 | +
chocolate | dark brown | 5.25 | 3.5 | +
strawberry | pink | 3.55 | 2.5 | +
strawberry | pink | 5.25 | 2.5 | +
In Pandas, we can use the merge()
function to join two tables together. The first parameter is the name of the second table to join on. The parameters left_on
and right_on
specify the columns to use in the left and right tables respectively. There are more parameters such as how
which specify what kind of join to perform (Inner (Default), Outer, Left, Right). For more information, refer to this Kaggle Tutorial
# creating new ratings df
-ratings_df = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
- "Stars" : [2.5, 3.5, 4]})
+
+# creating new ratings df
+ratings_df = pd.DataFrame({"Kind" : ['strawberry', 'chocolate', 'vanilla'],
+ "Stars" : [2.5, 3.5, 4]})
ratings_df
-
-
- | Kind | -Stars | -
---|---|---|
0 | -strawberry | -2.5 | -
1 | -chocolate | -3.5 | -
2 | -vanilla | -4.0 | -
+ | Kind | +Stars | +
0 | +strawberry | +2.5 | +
1 | +chocolate | +3.5 | +
2 | +vanilla | +4.0 | +
# merging cones and ratings
-cones_df.merge(ratings_df, left_on = "Flavor", right_on = "Kind")
+
+# merging cones and ratings
+cones_df.merge(ratings_df, left_on = "Flavor", right_on = "Kind")
-
-
- | Flavor | -Color | -Price | -Kind | -Stars | -
---|---|---|---|---|---|
0 | -strawberry | -pink | -3.55 | -strawberry | -2.5 | -
1 | -strawberry | -pink | -5.25 | -strawberry | -2.5 | -
2 | -chocolate | -light brown | -4.75 | -chocolate | -3.5 | -
3 | -chocolate | -dark brown | -5.25 | -chocolate | -3.5 | -
4 | -chocolate | -dark brown | -5.25 | -chocolate | -3.5 | -
+ | Flavor | +Color | +Price | +Kind | +Stars | +
0 | +strawberry | +pink | +3.55 | +strawberry | +2.5 | +
1 | +strawberry | +pink | +5.25 | +strawberry | +2.5 | +
2 | +chocolate | +light brown | +4.75 | +chocolate | +3.5 | +
3 | +chocolate | +dark brown | +5.25 | +chocolate | +3.5 | +
4 | +chocolate | +dark brown | +5.25 | +chocolate | +3.5 | +
In datascience
, we learned to plot data using histograms, line plots, scatter plots and histograms. The corresponding functions were hist()
, plot()
, scatter()
and barh()
. Plotting methods in Pandas are nearly identical to datascience
since both build on the library matplotlib
In this section we will go through examples of such plots in Pandas
-In datascience
, we used hist()
to create a histogram. In this example, we will be using data from baby.csv
. Recall that the baby data set contains data on a random sample of 1,174 mothers and their newborn babies. The column Birth.Weight
contains the birth weight of the baby, in ounces; Gestational.Days
is the number of gestational days, that is, the number of days the baby was in the womb. There is also data on maternal age, maternal height, maternal pregnancy weight, and whether or not the mother was a smoker.
# importing matplotlib for plotting
+
+# importing matplotlib for plotting
import matplotlib
%matplotlib inline
-
-
# reading in the data
-datascience_baby = Table.read_table('data/baby.csv')
+
+# reading in the data
+datascience_baby = Table.read_table('data/baby.csv')
datascience_baby
-
-
Birth Weight | Gestational Days | Maternal Age | Maternal Height | Maternal Pregnancy Weight | Maternal Smoker | -|
---|---|---|---|---|---|---|
120 | 284 | 27 | 62 | 100 | False | -|
113 | 282 | 33 | 64 | 135 | False | -|
128 | 279 | 28 | 64 | 115 | True | -|
108 | 282 | 23 | 67 | 125 | True | -|
136 | 286 | 25 | 62 | 93 | False | -|
138 | 244 | 33 | 62 | 178 | False | -|
132 | 245 | 23 | 65 | 140 | False | -|
120 | 289 | 25 | 62 | 125 | False | -|
143 | 299 | 30 | 66 | 136 | True | -|
140 | 351 | 27 | 68 | 120 | False | -|
Unnamed: 0 | Birth.Weight | Gestational.Days | Maternal.Age | Maternal.Height | Maternal.Pregnancy.Weight | Maternal.Smoker | +
1 | 120 | 284 | 27 | 62 | 100 | False | +
2 | 113 | 282 | 33 | 64 | 135 | False | +
3 | 128 | 279 | 28 | 64 | 115 | True | +
4 | 108 | 282 | 23 | 67 | 125 | True | +
5 | 136 | 286 | 25 | 62 | 93 | False | +
6 | 138 | 244 | 33 | 62 | 178 | False | +
7 | 132 | 245 | 23 | 65 | 140 | False | +
8 | 120 | 289 | 25 | 62 | 125 | False | +
9 | 143 | 299 | 30 | 66 | 136 | True | +
10 | 140 | 351 | 27 | 68 | 120 | False | +
... (1164 rows omitted)
# creating a histogram
-datascience_baby.hist('Birth Weight')
+
+# creating a histogram
+datascience_baby.hist('Birth.Weight')
-
-
In Pandas, we use hist()
to create histograms, just like datascience
. Refer to the documentation for a full list of parameters
# creating a histogram
-baby.hist('Birth Weight')
+
+# creating a histogram
+baby.hist('Birth.Weight')
-
-
array([[<AxesSubplot:title={'center':'Birth Weight'}>]], dtype=object)+
array([[<Axes: title={'center': 'Birth.Weight'}>]], dtype=object)
# line plot in datascience
-census_tbl = Table.read_table("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
-children_tbl = census_tbl.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
-children_tbl.plot('AGE')
+
+# line plot in datascience
+census_tbl = Table.read_table("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv").select(['SEX', 'AGE', 'POPESTIMATE2014'])
+children_tbl = census_tbl.where('SEX', are.equal_to(0)).where('AGE', are.below(19)).drop('SEX')
+children_tbl.plot('AGE')
-
-
In Pandas, we can use plot.line()
to create line plots. For a full list of parameters, refer to the documentation
#Pandas
-census_df = pd.read_csv("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
-children_df = census_df[(census_df.SEX == 0) & (census_df.AGE < 19)].drop("SEX", axis=1)
-children_df.plot.line(x="AGE", y="POPESTIMATE2014")
+
+#Pandas
+census_df = pd.read_csv("https://raw.githubusercontent.com/data-8/materials-x18/master/lec/x18/1/census.csv")[["SEX", "AGE", "POPESTIMATE2014"]]
+children_df = census_df[(census_df.SEX == 0) & (census_df.AGE < 19)].drop("SEX", axis=1)
+children_df.plot.line(x="AGE", y="POPESTIMATE2014")
-
-
<AxesSubplot:xlabel='AGE'>+
<Axes: xlabel='AGE'>