Part 0: Introduction
**Data science articulated, data science examples, history and context, technology landscape **
Readings
- (example) Yong-Yeol Ahn, Sebastian E. Ahnert, James P. Bagrow, Albert-László Barabási, Flavor network and the principles of food pairing, Scientific Reports 1, Article number: 196 doi:10.1038/srep00196
- (example) Acerbi A, Lampos V, Garnett P, Bentley RA (2013) The Expression of Emotions in 20th Century Books. PLoS ONE 8(3): e59030. doi:10.1371/journal.pone.0059030
- (example) Google Flu Trends (plus: David Wagner, Google Flu Trends Wildly Overestimated This Year's Flu Outbreak, Atlantic Wire, February 13, 2013)
- Eigenfactor, and publications
- (example) L'Aquila quake: Italy scientists guilty of manslaughter, BBC
- Drew Conway's Venn Diagram_ _
- Mike Loukides, _ What is data science?, O'Reilly Radar, 2010_
- Mike Driscoll, " The Seven Secrets of Successful Data Scientists", Dataspora
- Origins of "Volume, Velocity, Variety"_ _
- eScience: The Fourth Paradigm (Foreward and Introduction, pages xi - xxxi; Gray's Laws, pages 5-12)
- Chris Anderson, "The End of Theory: The Data Deluge Makes the Scientific Method Obsolete" , Wired magazine, 2008
- Responses to Chris Anderson, 2008
Part 1: Data Manipulation, at Scale
Databases and the relational algebra
Readings
- How Vertica Was the Star of the Obama Campaign, and Other Revelations
- E. F. Codd, 1981 Turing Award Lecture, " Relational Database: A Practical Foundation for Productivity", 1981 (Think about which arguments from this short piece are still relevant today.)
- [Advanced] Cohen et al. "MAD Skills: New Analysis Practices for Big Data", 2009
- [Advanced] Erik Meijer, Gavin Bierma co-Relational Model of Large Shared Data Banks, Communications of the ACM, 2011
**MapReduce, Hadoop, relationship to databases, algorithms, extensions, language; key-value stores and NoSQL; tradeoffs of SQL and NoSQL **
Readings
- Ullman, Rajaraman, Mining of Massive Datasets, Chapter 2
- Stonebraker et al., " MapReduce and Parallel DBMS's: Friends or Foes?", Communications of the ACM, January 2010.
- Dean and Ghemawat, " MapReduce: A Flexible Data Processing Tool", _Communications of the ACM, _January 2010.
- Rick Cattell, " Scalable SQL and NoSQL Data Stores", SIGMOD Record, December 2010 (39:4)
- Michael Armbrust, Reynold S. Xin, Cheng Lian, Yin Huai, Davies Liu, Joseph K. Bradley, Xiangrui Meng, Tomer Kaftan, Michael J. Franklin, Ali Ghodsi, and Matei Zaharia. 2015. Spark SQL: Relational Data Processing in Spark. SIGMOD '15
Data cleaning, entity resolution, data integration, information extraction
Readings
- Elmagarmid, et. al. "Duplicate Record Detection: A Survey"
- Koudas, et. al. "Record Linkage: Similarity Measures and Algorithms"
Part 2: Analytics
Basic statistical modeling, experiment design
Readings
- Chapter 3 of A Handbook of Statistical Analyses Using R
- Gregory Park on overfitting to the leaderboard in a Kaggle Competition
Introduction to Machine Learning, supervised learning, decision trees/forests, simple nearest neighbor
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read C4.5)
- Ullman, Rajaraman, Mining of Massive Datasets , Chapter 1
- Pedro Domingos, A Few Useful Things to Know about Machine Learning, CACM 55(10), 2012
Unsupervised learning: k-means, multi-dimensional scaling
Readings
- Xindong Wu et al., Top 10 Algorithms in Data Mining, Knowledge and Information Systems, 14(2008), 1: 1-37. (read k-means)
Part 3: Interpreting and Communicating Results
**Visualization, visual data analytics **
Readings (well, watchings)
- Hans Rosling, The Joy of Stats
- Pat Hanaran, Tools for Data Enthusiasts
- Jeffrey Heer, Michael Bostock, Vadim Ogievetsky, A Tour through the Visualization Zoo, Communications of the ACM, Volume 53 Issue 6, June 2010
Ethics, privacy
- Howard Wen, " Big Ethics for Big Data", O'Reilly Media
- John Markoff, New York Times, Unreported Side Effects of Drugs Are Found Using Internet Search Data, March 13, 2013
- Mike Loukides, Data Skepticism, O'Reilly Media, April 2013
Part 4: Special Topics
- Graph Analytics: PageRank, community detection, recursive queries, iterative processing
- Guest Lecture: Datameer
- Guest Lecture: Wibidata
_Readings _
- Dan Mckinley, Whom the Gods Would Destroy, they First Give Real-Time Analytics
- Joseph Wilk, Latent Semantic Analysis in Python