Nick Hagerty, Montana State University
Except where otherwise noted, this work is licensed under Creative Commons BY-NC-SA 4.0.
Skip to: Lecture slides | Supplemental labs | External resources
Fall 2024
- About R
- Operators
- Objects and functions
- Data frames
- Vectors
- Indexing
- If/else statements
- For-loops
- Functions
- Vectorization
- Parallelization
Topic 3: Productivity Tools
- Philosophy of tidy data
- Wrangling data with
dplyr
- Joining data with
dplyr
- Tidying data with
tidyr
- Importing data with
readr
- Join safety
- Keys and relational data
- String cleaning
- Number storage
- Data Cleaning Checklist (pdf version)
- Where data comes from
- Webscraping
- Using APIs
Topic 7: Best Practices of Coding and Workflows
- The perils of bad data cleaning
- Reproducibility and transparency
- Best practices (code organization, file organization, version control, abstraction, commenting, unit tests)
Topic 8: Distinguishing Goals of Data Analysis
- The Data Generating Process
- Potential outcomes, counterfactuals, and causal inference
- Descriptive, Predictive, or Causal Analysis?
Topic 9: Exploratory Analysis
-
Part 1: Understanding variables
- Summaries, frequency tables and crosstabs in R
- Characterizing distributions
- Handling extreme values
- Handling variable transformations
- Handling missing data
-
Part 2: Understanding relationships
- Characterizing relationships
- Binscatter
- The Conditional Expectation Function
- Adjusting for other variables
- Bin smoothing and local regression
- Basic regression in R
- Indicator and interaction terms
- Econometrics packages in R
- Modeling nonlinear relationships
- Using regression models for prediction
- Basics of
ggplot2
- Plotting examples
- Colors and themes
- Principles of data visualization
- Case studies
Topic 12: Spatial Analysis
- Intro to Geospatial Data
- Part 1
- Spatial data and quick mapping
- Reference systems and projections
- Part 2
- Spatial queries (measurement, relationships)
- Spatial subsetting
- Geometry operations
- Spatial joins
Topic 13: Machine Learning Fundamentals
- Overview: Statistical learning
- Assessing model accuracy
- Cross-validation
Topic 14: Prediction Methods
- Part 1: Shrinkage methods
- Shrinkage methods
- Ridge regression
- Lasso (and elasticnet)
- Part 2: Learning with
tidymodels
- Setup and splitting
- Recipes
- Workflows
- Tuning
- Prediction
- Dependence
Topic 15: Machine Learning in Economics
- Predicting outcomes
- Constructing new data
- Selecting covariates
- Predicting causal effects
Topic 16: Databases and Big Data
- Tools for big data
- Databases in R
- Writing SQL queries
- Getting started with BigQuery
ML Methods for Classification Tasks
- Part 1: Methods
- Classification
- Logistic regression
- k-nearest neighbors
- Model assessment
- Decision trees
- Part 2: Examples
- Logistic regression and KNN
- Cross-validation
- Decision trees
- Teach your laptop to read
By Laura Sikoski
- Lab 0: Getting started with R
- Lab 1: Intro
- Lab 2: Cleaning
- Lab 3: Merges
- Lab 4: Pivots
- Lab 5: Regressions
- Lab 6: If/else statements
- Lab 7: For loops
- Lab 8: Functions
- Lab 9: Intro to ggplot2
This is a list of further resources that you may find helpful throughout (and after!) this course. Start with the course materials above, but check these out for alternative explanations or if you want to take a deeper dive into a particular topic. If one isn't speaking to you, try another.
- Introduction to Data Science Ch. 3: R Basics (Rafael A. Irizarry). Data types, data frames, vectors, indexing, basic plots.
- Modern Data Science with R Appendix B: Introduction to R and RStudio (Baumer, Kaplan, and Horton). Installation, help, objects, vectors, indexing, operators, lists, matrices, data frames, attributes and classes, functions, packages.
- R for Social Scientists (The Carpentries).
- Big Data in Economics Lecture 4: R Language Basics (Grant McDermott). Slides. Logic, evaluation, assignment, help, objects, names, indexing, lists.
- Cheat Sheet: Base R (RStudio).
- DataCamp tutorials:
- Introduction to R (free to everyone)
- Intermediate R (free for 6 months for enrolled students)
- Introduction to Data Science Ch. 4: Programming Basics (Rafael A. Irizarry). If/else, writing functions, for-loops, vectorization, functionals.
- Intermediate R (Montana State University R Workshops Team). Relational operators, logicals, conditional statements, loops, functions.
- R for Data Science Ch. 17-21 (Hadley Wickham). Pipes, functions, vectors, iteration.
- Big Data in Economics (Grant McDermott):
- Lecture 10: Functions in R: (1) Introductory Concepts. Function syntax, control flow, iteration, vectorization. The section on functional programming uses the tidyverse, which we're covering later in the course.
- Lecture 11: Functions in R: (2) Advanced Concepts. Debugging, catching user errors, caching.
- Lecture 12: Parallel programming.
- DataCamp tutorial (free for 6 months for enrolled students): Introduction to Writing Functions in R.
- Intro to R Markdown (Danielle Navarro). Nice overview of why we're using R Markdown, and examples of how to use it.
- Using R Markdown for Class Assignments (Nate Taback). A pretty quick overview.
- Cheat Sheet: R Markdown (RStudio). Most of what you want to know on 2 pages.
- R Markdown Reference Guide (RStudio). A bit more comprehensive than the cheat sheet.
- R for Data Science Ch. 27: R Markdown (Hadley Wickham). Comprehensive introduction.
- GitHub Guides: Hello World. Branches and merges in GitHub.
- GitHub Guides: Forking Projects. Forking and pull requests in GitHub.
- Happy Git and GitHub for the useR (Jenny Bryan). How to take advantage of RStudio's convenient built-in features that integrate Git and GitHub.
- Big Data in Economics Lecture 2: Version Control with Git(Hub) (Grant McDermott). Getting started with Git at the command line.
- Data Science for Economists Ch. 3: Using Git and GitHub.com (Tyler Ransom). More on Git at the command line.
- Wrangling, Analyzing and Exporting Data with the Tidyverse (Montana State R Workshops Team). Interactive tutorial.
- Data Wrangling and Manipulation in R (UC Berkeley D-Lab). Slides with coding examples. Functions, for-loops, if/else, Monte Carlo simulations.
- Modern Data Science with R Chapters 4-6 (Baumer, Kaplan, and Horton). Chapter 4: Data wrangling with dplyr. Chapter 5: Joins. Chapter 6: Tidy data and tidyr.
- ModernDive Chapter 3: Data Wrangling (Ismay & Kim). Data wrangling with dplyr.
- RStudio Cheat Sheets:
- DataCamp tutorials (free for 6 months for enrolled students):
- Tidyverse Skills for Data Science Chapter 3 (Carrie Wright, Shannon E. Ellis, Stephanie C. Hicks and Roger D. Peng). Covers the basics of working with factors, dates and times, strings, and text as data.
- R for Data Science (Hadley Wickham):
- RStudio Cheat Sheets:
- Quartz Guide to Bad Data.
- How to Find Data: Tips for Finding Data (Davidson College Library).
- Data Sets for Quantitative Research: Public Use Datasets (University of Missouri Libraries).
- Data Science for Economists Lecture 6: Webscraping: (1) Server-side and CSS (Grant McDermott).
- Data Science for Economists Lecture 7: Webscraping: (2) Client-side and APIs (Grant McDermott).
- An Introduction to APIs (Zapier).
- Evidence on Research Transparency in Economics (Edward Miguel, Journal of Economic Perspectives 2021).
- Best Practices for Scientific Computing (Wilson et al. 2014).
- Code and Data for the Social Sciences: A Practitioner's Guide (Gentzkow & Shapiro 2014).
- Coding for Economists: A Language-Agnostic Guide to Programming for Economists (Ljubica "LJ" Ristovska 2019).
- The tidyverse style guide (Hadley Wickham).
- Thinking Clearly with Data (Ethan Bueno de Mesquita and Anthony Fowler 2021, Princeton University Press).
- The Effect: An Introduction to Research Design and Causality Chapter 5: Identification (Nick Huntington-Klein).
- Causal Inference: The Mixtape Chapter 4: Potential Outcomes Causal Model (Scott Cunningham).
- Introduction to Data Science Chapter 8: Visualizing data distributions (Rafael A. Irizarry). Histograms, density plots, stratification.
- Introduction to Data Science Chapter 28: Smoothing (Rafael A. Irizarry). Bin smoothing, kernels, and local regression.
- DataCamp tutorial (free for 6 months for enrolled students): Exploratory Data Analysis in R.
- Big Data in Economics Lecture 8: Regression Analysis in R (Grant McDermott).
- Prediction and Machine Learning Lab 4: Regression with R (Ed Rubin).
- ISLR Ch. 7: Moving Beyond Linearity (James, Witten, Hastie, Tibshirani). Polynomial regressions, step functions, splines.
- DataCamp tutorials (free for 6 months for enrolled students):
- Introduction to Data Science Chapters 6-10: Data Visualization (Rafael A. Irizarry).
- Modern Data Science with R Chapters 2-3 (Baumer, Kaplan, and Horton). Chapter 2: Principles of data visualization. Chapter 3: Plotting with ggplot2.
- Data Visualization: A practical introduction (Kieran Healy). Online book for both principles and methods/examples.
- From Data to Viz (Yan Holtz & Conor Healy). "Leads you to the most appropriate graph for your data. It links to the code to build it and lists common caveats you should avoid."
- Cheat Sheet: Data visualization with ggplot2 (RStudio).
- An Economist's Guide to Visualizing Data (Jonathan Schwabish, Journal of Economic Perspectives 2014.)
- DataCamp tutorials (free for 6 months for enrolled students):
- R Geospatial Fundamentals (UC Berkeley D-Lab). Great tutorial. Core concepts, vector data, spatial analysis, raster data.
- Introduction to Geospatial Raster and Vector Data with R (The Carpentries). Another great tutorial, though focused more on ecology than social science.
- Modern Data Science with R Chapters 17-18 (Baumer, Kaplan, and Horton). Chapter 17: Working with geospatial data. Chapter 18: Geospatial computations.
- Geocomputation with R (Lovelace, Nowosad, and Muenchow). Comprehensive treatment of GIS tools in R.
- Raster Analysis with
terra
(Aaron Maxwell). Up-to-date introduction to working with raster data in R.
- ISLR Ch. 2: Statistical Learning (James, Witten, Hastie, Tibshirani). Statistical learning, assessing model accuracy.
- ISLR Ch. 5: Resampling Methods (James, Witten, Hastie, Tibshirani). Cross-validation, the bootstrap.
- Introduction to Data Science Chapters 27 & 29 (Rafael A. Irizarry). Chapter 27: Introduction to machine learning. Chapter 29: Cross validation.
- Prediction and Machine Learning Lectures 0-3 (Ed Rubin).
- Lecture 000: Overview (Why predict?)
- Lecture 001: Statistical learning foundations
- Lecture 002: Model accuracy
- Lecture 003: Resampling Methods
- ISLR Ch. 6: Linear Model Selection & Regularization (James, Witten, Hastie, Tibshirani). Subset selection, shrinkage (ridge, lasso), dimension reduction.
- Prediction and Machine Learning Lecture 5: Shrinkage methods (Ed Rubin).
- ISLR Ch. 4: Classification (James, Witten, Hastie, Tibshirani). Logistic regression, discriminant analysis, naive Bayes.
- Prediction and Machine Learning Lecture 6: Classification (Ed Rubin).
- Modern Data Science with R Chapters 10-11 (Baumer, Kaplan, and Horton). Chapter 10: Predictive modeling. Chapter 3: Supervised learning.
- Introduction to Data Science Chapters 31-32 (Rafael A. Irizarry). Chapter 31: Examples of algorithms. Chapter 32: Machine learning in practice.
- ISLR tidymodels Labs (Emil Hvitfeldt). All labs from ISLR written using the tidymodels library.
- Prediction and Machine Learning Labs (Ed Rubin and Stephen Reed).
- Kaggle notebooks on "tidymodels-ing"
- Labs 3-5
- ISLR Ch. 12: Unsupervised Learning (James, Witten, Hastie, Tibshirani).
- Modern Data Science with R Chapter 12: Unsupervised Learning (Baumer, Kaplan, and Horton).
- Introduction to Data Science Chapter 34: Clustering (Rafael A. Irizarry).
- ISLR (James, Witten, Hastie, Tibshirani).
- Ch. 8: Tree-Based Methods
- Ch. 9: Support Vector Machines
- Ch. 10: Deep Learning
- Prediction and Machine Learning Lectures (Ed Rubin).
- Lecture 007: Decision Trees
- Lecture 008: Ensemble Methods
- Lecture 009: Support Vector Machines
- "Machine Learning: An Applied Econometric Approach" (Mullainathan & Spiess, Journal of Economic Perspectives 2017).
- "Beyond prediction: Using big data for policy problems" (Susan Athey, Science 2017).
- "The Impact of Machine Learning on Economics" (Susan Athey, 2019).
- "Machinistas meet randomistas: useful ML tools for empirical researchers," (Esther Duflo, NBER Summer Institute Master Lecture 2018).
- Slides on Machine Learning (Colin Cameron).
- "Machine Learning for Economists" (Dario Sansone). Long list of resources, applications, and citations.
- Data Science for Economists Lecture 16: Databases (Grant McDermott).
- SQL for R Users (UC Berkeley D-Lab).
- Data Management with SQL for Social Scientists (The Carpentries).
- Modern Data Science with R Chapter 15: Database querying using SQL (Baumer, Kaplan, and Horton).
- Data Science for Economists (Grant McDermott):
- Modern Data Science with R Chapter 21: Epilogue: Towards "big data" (Baumer, Kaplan, and Horton).