Introduction to Data Analysis with Python - lecture materials by Peter Duronelly and Ádám Vig with Ágoston Reguly (Georgia Tech) and Gábor Békés (CEU, KRTK, CEPR)
This course material is a supplement to Data Analysis for Business, Economics, and Policy by Gábor Békés (CEU) and Gábor Kézdi (U. Michigan), Cambridge University Press, 2021.
Textbook information: see the textbook's website gabors-data-analysis.com or visit Cambridge University Press
To get a copy: Inspection copy for instructors or buy from Amazon or order online around the globe
We'd like to say thanks for Ágoston Reguly who created the template for Coding for Data Analysis series. We followed his steps in writing the Python-version of the teaching material.
We thank CEU Department of Econimics and Business for financial support.
This is version 0.1, as of 13 September, 2022.
Comments are really welcome -- just add a GitHub issue.
The course is an introduction to the Python programming language, its software environment, and to data exploration, data transformation, visualization, and more advanced data analysis also. The idea is that people will learn working with Python along with learning to carry out data analysis.
The material primarily consists of Jupyter notebooks
and is sometimes supplemented with additional data. In most cases, however, we used the textbook's datasets to bring the course as close to the original textbook as possible.
Lectures 0 to 9 mostly complements Part I: Data Exploration (Chapter 1-6).Lectures 0 to 6 are general introductions to Python and its concepts. These notebooks focus on coding principles, Python's main building blocks, and introduce the data analyst's most important data structure: Pandas dataframes. Lecture 7 gives insight how to use Python for data exploration. Lectures 8 and 9 expands the toolkit for advanced data analytics techniques.
Lecture 10 to 17 complements PART II: Regression Analysis (Chapter 7-12) and cover everything you need to know about linear regression in Python on an introductionary level. We start with simple linear regression on cross-sectional data, then we explore binary models, and multiple linear regression. Finally, we discuss the basic time-series regression models and spatial data visualization.
Lecture 18 to 25 complements PART III: Prediction (Chapter 13-18). These lectures are not intended to be part of an introductory Python course, but rather a more advanced seminar to support Data Analysis with machine learning tools for prediction. In this seminar-style course, students will cover topics such as model selection with cross-validation, LASSO, RIDGE or Elastic Net regularization, regression trees with CART, random forest, and boosting. These methods are applied to cross-sectional data, especially to the continuous outcome, and for binary outcomes to model probability and handle classification problems also. Time series modeling on the long run and short run via ARIMA and VAR models are also covered. For properly understanding this material, the prerequisite is to complete the coding lectures from 1 to 17.
We tried to put together a benchmark course to supplement the Data Analysis textbook and to help anyone, students and instructors alike, follow the book's material. Anyone is free to use the notebooks in their current or in any modified form, with proper reference to the original material.
While we teach the basics on Python, this is not a classical coding course material. The notebooks take the reader through the data analysis workflow of the first 12 chapters of the textbook providing assistance in Python along the way. You will learn gradually what is needed to carry out analytical steps from loading data to running regressions. We will suggest additional resources to learn more coding tools and enhance your skills.
It is possible to learn the very basics of Python using these notebooks, but simply completing the exercises won't make anyone a programmer. Using the codebase and the textbook together, however, does help in understanding statistical and data analytics concepts and see the theory in practice.
The lectures are pre-written, which an educated reader can follow and understand. Nevertheless, instructors may want to modify and tailor-make the codes according to their own teaching habits and philosophy. Homeworks are not part of the codebase, giving instructors another task in the practical coding sessions of their data analytics courses.
The material's focus is the manipulation and analysis of tabular data. Pandas
dataframes provide most of the tools for these manipulation exercises, and we use the statsmodels
package for running linear regressions. As for data visualization, we added a basic intro to the most popular matplotlib
package but rely heavily on a new favourite: plotnine
, the Python-implementation of R's ggplot, for visualization and graphical representation.
Lecture | Learning outcomes | Case study | Dataset |
---|---|---|---|
lecture00-intro | basic terminology; Jupyter notebooks ; how to setup the environment | - | - |
lecture01-coding-basics | coding basics; basic variable types | - | - |
lecture02-basic-structures | basic data structures: lists, tuples, sets, distionaries; working with modules | - | - |
lecture03-data-IO | how to read and write files; navigating the file system | - | - |
lecture04-pandas-basics | pandas dataframe basics; how manipulate tabular data with Pandas: conversion, filtering, replacing values, adding new variables, sorting; using pipes | Ch03: Finding a good deal among hotels: data exploration | hotels-vienna |
lecture05-graphs-basics | matplotlib basics; deep-dive into plotline | Ch03: Finding a good deal among hotels: data exploration | stocks-sp500; hotels-vienna |
lecture06-conditionals | conditional statements; for loop, while loop, list comprehension | - | - |
lecture07-data-exploration | descriptive statistics, customized plots, hypothesis testing, correlation & association | CH01B Comparing online and offline prices | billion-prices |
lecture08-functions | user-defined functions & lambda functions | - | - |
lecture09-exception-handling | try-except | - | - |
lecture10-intro-to-regression | binary means, non-parametric regression (lowess), simple linear regression (OLS), analysis of results, log and non-linear regressions | CH07A Finding a good deal among hotels with simple regression, CH08A Finding a good deal among hotels with non-linear function | hotels-vienna |
lecture11-feature-engineering | creating new variables from existing ones, creating ordered variables, factors/dummy variables, imputing, randomizing, log transformation, winsorizing | Ch04A Management quality and firm size; CH17A Predicting firm exit | wms-management-survey; bisnode-firms |
lecture12-simple-linear-regression | EDA, how to decide on functional form, models with logs and polynomials, piecewise linear spline, weighted OLS, residual analysis | CH08B How is life expectancy related to the average income of a country? | worldbank-lifeexpectancy |
lecture13-advanced-linear-regression | multiple linear regression, how to choose from models, confidence interval, prediction interval, robustness & external validity, training and test sample | Ch03B Comparing hotel prices in Europe: Vienna vs London, CH09B How stable is the hotel price–distance to center relationship? | hotels-europe |
lecture14-binary-models | saturated probability models, mupltiple regression with binary outcomes, logit & probit models, comparison of non-linear models | CH11A Does smoking pose a health risk? | share-health |
lecture15-datetime | handling date and time, time zone awareness, datetime in Pandas | - | - |
lecture16-timeseries-regression | manipulation of time-series data, autocorrelation, Newey-West standard errors, lagged variables | CH12B Electricity consumption and temperature | arizona-electricity |
lecture17-basic-spatial-viz | Introducing to spatial visualization. How to create world map and show life expectancy or color the average hotel prices for London boroughs or Vienna districts. Handling maps via geom_polygon and set the scaling, colors, etc. |
Ch08B: Life expectancy* , Ch03B: Compare hotel prices Vienna vs London* | worldbank-lifeexpectancy, hotels-europe |
lecture18-cross-validation | Model comparison introduced by BIC and RMSE. Limitations of these comparisons. Cross-validation: using different samples to tackle overfitting. | Ch13A Predicting used car value with linear regressions and Ch14A Predicting used car value: log prices | used-cars |
ecture19-lasso | Feature engineering for LASSO: interactions and polynomials. Cross-validation in detail. LASSO (and RIDGE, Elastic Net). Training-test samples and the holdout sample to evaluate predictions. LASSO diagnostics. | Ch14B Predicting AirBnB apartment prices: selecting a regression model | airbnb |
lecture20-regression-tree | Estimating regression tree. Understanding regression trees and comparing them to linear regressions. Tuning and setup of CART. Tree and variable importance plots. | CH15A Predicting used car value with regression trees | used-cars |
lecture21-random-forest | Data cleaning and feature engineering specifics for random forest (RF). Estimate RFs. Examine the results of RFs with variable importance plots, and partial dependence plots, and check the quality of predictions in (important) subgroups. Extreme Gradient Boosting Method (GBM) via xgboost package. Prediction comparisons (prediction horse-race) for OLS, LASSO, CART, RF, and XGBM. |
Ch16A Predicting apartment prices with random forest | airbnb |
lecture22-classification | Predicting probabilities and classification with machine learning tools. Cross validated logit models. LASSO with logit, CART, and Random Forest. Classification of probabilities, ROC curve, and AUC. Confusion Matrix. Model comparison via RMSE or AUC. User-defined loss function to weight false-positive and false-negative rate. Optimizing threshold value for classification to get best loss function value. | CH17A Predicting firm exit: probability and classification | bisnode-firms |
lecture23-long-term-time-series | Forecasting time series data on the long run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. Modeling with deterministic trend, seasonality and other dummy variables for long term horizon. Evaluation of model and forecast precision. prophet as machine learning tool for time series data. |
Ch18A Forecasting daily ticket sales for a swimming pool | swim-transactions |
lecture24-short-term-time-series | Forecasting time series data on the short run. Feature engineering with time series, deciding transformations for stationarity. Cross-validation options with time series. ARIMA and VAR models for short term forecasting. Evaluation of forecasts on short run: performance on hold out set, fan-chart to assess risks and stability of forecasting performance on an extended time period. | CH18B Forecasting a house price index | case-shiller-la |
lecture25-matplotlib-vs-plotnine | Same graphs in two separate notebooks, to show that exactly the same graphs can be created with matplotlib (and with its high-level interface, seaborn ) and with plotnine . |
CH08B How is life expectancy related to the average income of a country? | worldbank-lifeexpectancy |
Most data science courses use the Anaconda environment for Python. However, we use pip
and pipenv
, and run Jupyter notebooks from the course's environment. Anaconda is a great tool for data analysis and data science, but once someone goes beyond ad-hoc data analysis and needs to develop and deploy advanced data solutions in a production environment in Python, pip
is going to be the way to go.