Coding for Data Analysis with Python

Introduction to Data Analysis with Python - lecture materials by Peter Duronelly and Ádám Vig with Ágoston Reguly (Georgia Tech) and Gábor Békés (CEU, KRTK, CEPR)

This course material is a supplement to Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021.

Textbook information: see the textbook's website gabors-data-analysis.com or visit Cambridge University Press

To get a copy: Inspection copy for instructors or buy from Amazon or order online around the globe

Acknowledgments

We'd like to say thanks for Ágoston Reguly who created the template for Coding for Data Analysis series. We followed his steps in writing the Python-version of the teaching material.

We thank CEU Department of Econimics and Business for financial support.

Status

This is version 0.1, as of 13 September, 2022.

Comments are really welcome -- just add a GitHub issue.

Overview

The course is an introduction to the Python programming language, its software environment, and to data exploration, data transformation, visualization, and more advanced data analysis also. The idea is that people will learn working with Python along with learning to carry out data analysis.

The material primarily consists of Jupyter notebooks and is sometimes supplemented with additional data. In most cases, however, we used the textbook's datasets to bring the course as close to the original textbook as possible.

Lectures 0 to 9 mostly complements Part I: Data Exploration (Chapter 1-6).Lectures 0 to 6 are general introductions to Python and its concepts. These notebooks focus on coding principles, Python's main building blocks, and introduce the data analyst's most important data structure: Pandas dataframes. Lecture 7 gives insight how to use Python for data exploration. Lectures 8 and 9 expands the toolkit for advanced data analytics techniques.

Lecture 10 to 17 complements PART II: Regression Analysis (Chapter 7-12) and cover everything you need to know about linear regression in Python on an introductionary level. We start with simple linear regression on cross-sectional data, then we explore binary models, and multiple linear regression. Finally, we discuss the basic time-series regression models and spatial data visualization.

Lecture 18 to 25 complements PART III: Prediction (Chapter 13-18). These lectures are not intended to be part of an introductory Python course, but rather a more advanced seminar to support Data Analysis with machine learning tools for prediction. In this seminar-style course, students will cover topics such as model selection with cross-validation, LASSO, RIDGE or Elastic Net regularization, regression trees with CART, random forest, and boosting. These methods are applied to cross-sectional data, especially to the continuous outcome, and for binary outcomes to model probability and handle classification problems also. Time series modeling on the long run and short run via ARIMA and VAR models are also covered. For properly understanding this material, the prerequisite is to complete the coding lectures from 1 to 17.

Philosopy and how to use

We tried to put together a benchmark course to supplement the Data Analysis textbook and to help anyone, students and instructors alike, follow the book's material. Anyone is free to use the notebooks in their current or in any modified form, with proper reference to the original material.

While we teach the basics on Python, this is not a classical coding course material. The notebooks take the reader through the data analysis workflow of the first 12 chapters of the textbook providing assistance in Python along the way. You will learn gradually what is needed to carry out analytical steps from loading data to running regressions. We will suggest additional resources to learn more coding tools and enhance your skills.

It is possible to learn the very basics of Python using these notebooks, but simply completing the exercises won't make anyone a programmer. Using the codebase and the textbook together, however, does help in understanding statistical and data analytics concepts and see the theory in practice.

The lectures are pre-written, which an educated reader can follow and understand. Nevertheless, instructors may want to modify and tailor-make the codes according to their own teaching habits and philosophy. Homeworks are not part of the codebase, giving instructors another task in the practical coding sessions of their data analytics courses.

The material's focus is the manipulation and analysis of tabular data. Pandas dataframes provide most of the tools for these manipulation exercises, and we use the statsmodels package for running linear regressions. As for data visualization, we added a basic intro to the most popular matplotlib package but rely heavily on a new favourite: plotnine, the Python-implementation of R's ggplot, for visualization and graphical representation.

Course content

Lecture	Learning outcomes	Case study	Dataset
lecture00-intro	basic terminology; Jupyter notebooks ; how to setup the environment	-	-
lecture01-coding-basics	coding basics; basic variable types	-	-
lecture02-basic-structures	basic data structures: lists, tuples, sets, distionaries; working with modules	-	-
lecture03-data-IO	how to read and write files; navigating the file system	-	-
lecture04-pandas-basics	pandas dataframe basics; how manipulate tabular data with Pandas: conversion, filtering, replacing values, adding new variables, sorting; using pipes	Ch03: Finding a good deal among hotels: data exploration	hotels-vienna
lecture05-graphs-basics	matplotlib basics; deep-dive into plotline	Ch03: Finding a good deal among hotels: data exploration	stocks-sp500; hotels-vienna
lecture06-conditionals	conditional statements; for loop, while loop, list comprehension	-	-
lecture07-data-exploration	descriptive statistics, customized plots, hypothesis testing, correlation & association	CH01B Comparing online and offline prices	billion-prices
lecture08-functions	user-defined functions & lambda functions	-	-
lecture09-exception-handling	try-except	-	-
lecture10-intro-to-regression	binary means, non-parametric regression (lowess), simple linear regression (OLS), analysis of results, log and non-linear regressions	CH07A Finding a good deal among hotels with simple regression, CH08A Finding a good deal among hotels with non-linear function	hotels-vienna
lecture11-feature-engineering	creating new variables from existing ones, creating ordered variables, factors/dummy variables, imputing, randomizing, log transformation, winsorizing	Ch04A Management quality and firm size; CH17A Predicting firm exit	wms-management-survey; bisnode-firms
lecture12-simple-linear-regression	EDA, how to decide on functional form, models with logs and polynomials, piecewise linear spline, weighted OLS, residual analysis	CH08B How is life expectancy related to the average income of a country?	worldbank-lifeexpectancy
lecture13-advanced-linear-regression	multiple linear regression, how to choose from models, confidence interval, prediction interval, robustness & external validity, training and test sample	Ch03B Comparing hotel prices in Europe: Vienna vs London, CH09B How stable is the hotel price–distance to center relationship?	hotels-europe
lecture14-binary-models	saturated probability models, mupltiple regression with binary outcomes, logit & probit models, comparison of non-linear models	CH11A Does smoking pose a health risk?	share-health
lecture15-datetime	handling date and time, time zone awareness, datetime in Pandas	-	-
lecture16-timeseries-regression	manipulation of time-series data, autocorrelation, Newey-West standard errors, lagged variables	CH12B Electricity consumption and temperature	arizona-electricity
lecture17-basic-spatial-viz	Introducing to spatial visualization. How to create world map and show life expectancy or color the average hotel prices for London boroughs or Vienna districts. Handling maps via `geom_polygon` and set the scaling, colors, etc.	Ch08B: Life expectancy* , Ch03B: Compare hotel prices Vienna vs London*	worldbank-lifeexpectancy, hotels-europe
lecture18-cross-validation	Model comparison introduced by BIC and RMSE. Limitations of these comparisons. Cross-validation: using different samples to tackle overfitting.	Ch13A Predicting used car value with linear regressions and Ch14A Predicting used car value: log prices	used-cars
ecture19-lasso	Feature engineering for LASSO: interactions and polynomials. Cross-validation in detail. LASSO (and RIDGE, Elastic Net). Training-test samples and the holdout sample to evaluate predictions. LASSO diagnostics.	Ch14B Predicting AirBnB apartment prices: selecting a regression model	airbnb
lecture20-regression-tree	Estimating regression tree. Understanding regression trees and comparing them to linear regressions. Tuning and setup of CART. Tree and variable importance plots.	CH15A Predicting used car value with regression trees	used-cars
lecture21-random-forest	Data cleaning and feature engineering specifics for random forest (RF). Estimate RFs. Examine the results of RFs with variable importance plots, and partial dependence plots, and check the quality of predictions in (important) subgroups. Extreme Gradient Boosting Method (GBM) via `xgboost` package. Prediction comparisons (prediction horse-race) for OLS, LASSO, CART, RF, and XGBM.	Ch16A Predicting apartment prices with random forest	airbnb
lecture22-classification	Predicting probabilities and classification with machine learning tools. Cross validated logit models. LASSO with logit, CART, and Random Forest. Classification of probabilities, ROC curve, and AUC. Confusion Matrix. Model comparison via RMSE or AUC. User-defined loss function to weight false-positive and false-negative rate. Optimizing threshold value for classification to get best loss function value.	CH17A Predicting firm exit: probability and classification	bisnode-firms
lecture23-long-term-time-series	Forecasting time series data on the long run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. Modeling with deterministic trend, seasonality and other dummy variables for long term horizon. Evaluation of model and forecast precision. `prophet` as machine learning tool for time series data.	Ch18A Forecasting daily ticket sales for a swimming pool	swim-transactions
lecture24-short-term-time-series	Forecasting time series data on the short run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. ARIMA and VAR models for short term forecasting. Evaluation of forecasts on short run: performance on hold out set, fan-chart to assess risks and stability of forecasting performance on an extended time period.	CH18B Forecasting a house price index	case-shiller-la
lecture25-matplotlib-vs-plotnine	Same graphs in two separate notebooks, to show that exactly the same graphs can be created with `matplotlib` (and with its high-level interface, `seaborn`) and with `plotnine`.	CH08B How is life expectancy related to the average income of a country?	worldbank-lifeexpectancy

Technical Note: environment

Most data science courses use the Anaconda environment for Python. However, we use pip and pipenv, and run Jupyter notebooks from the course's environment. Anaconda is a great tool for data analysis and data science, but once someone goes beyond ad-hoc data analysis and needs to develop and deploy advanced data solutions in a production environment in Python, pip is going to be the way to go.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

Coding for Data Analysis with Python

Acknowledgments

Status

Overview

Philosopy and how to use

Course content

Technical Note: environment

About

Releases

Packages

Contributors 3

Languages

Name		Name	Last commit message	Last commit date
Latest commit History 126 Commits
lecture00-intro		lecture00-intro
lecture01-coding-basics		lecture01-coding-basics
lecture02-basic-structures		lecture02-basic-structures
lecture03-data-IO		lecture03-data-IO
lecture04-pandas-basics		lecture04-pandas-basics
lecture05-graphs-basics		lecture05-graphs-basics
lecture06-conditionals		lecture06-conditionals
lecture07-data-exploration		lecture07-data-exploration
lecture08-functions		lecture08-functions
lecture09-exception-handling		lecture09-exception-handling
lecture10-intro-to-regression		lecture10-intro-to-regression
lecture11-feature-engineering		lecture11-feature-engineering
lecture12-simple-linear-regression		lecture12-simple-linear-regression
lecture13-advanced-linear-regression		lecture13-advanced-linear-regression
lecture14-binary-models		lecture14-binary-models
lecture15-datetime		lecture15-datetime
lecture16-timeseries-regression		lecture16-timeseries-regression
lecture17-basic-spatial-viz		lecture17-basic-spatial-viz
lecture18-cross-validation		lecture18-cross-validation
lecture19-lasso		lecture19-lasso
lecture20-regression-tree		lecture20-regression-tree
lecture21-random-forest		lecture21-random-forest
lecture22-classification		lecture22-classification
lecture23-long-term-time-series		lecture23-long-term-time-series
lecture24-short-term-time-series		lecture24-short-term-time-series
lecture25-matplotlib-vs-plotnine		lecture25-matplotlib-vs-plotnine
.gitignore		.gitignore
LICENSE		LICENSE
Pipfile		Pipfile
Pipfile.lock		Pipfile.lock
README.md		README.md

License

gabors-data-analysis/da-coding-python

Folders and files

Latest commit

History

Repository files navigation

Coding for Data Analysis with Python

Acknowledgments

Status

Overview

Philosopy and how to use

Course content

Technical Note: environment

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Contributors 3

Languages

Packages