-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathsyllabus.tex
executable file
·370 lines (306 loc) · 14.2 KB
/
syllabus.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
\documentclass[11pt]{article}
\pagestyle{empty}
\setlength{\textheight}{8.5in}
\setlength{\topmargin}{0.5in}
\setlength{\headheight}{0in}
\setlength{\headsep}{0in}
% %% \setlength{\footheight}{0in}
\setlength{\oddsidemargin}{0in}
\setlength{\textwidth}{6.5in}
\usepackage{times}
\usepackage{url}
\usepackage{algorithm}
\usepackage{mathtools}
\usepackage{mathptmx}
\usepackage{amssymb}
\usepackage{color}
\begin{document}
\sloppy
\begin{center}
\LARGE CAS CS 591\\
\Large Computational Tools for Data Science\\
\Large\rm Fall 2016\\~\\
\end{center}
\noindent{\large\bf Meeting Place:} SCI 117\\[\baselineskip]
\noindent{\large\bf Meeting Time:} TR 11-12:30
\\[\baselineskip]
\noindent{\large\bf Instructor:} Prof.\ Mark Crovella\\[0.75\baselineskip]
\begin{minipage}[t]{0.60\textwidth}
\begin{itemize}
\item {\bf Office:} MCS-140E
\item {\bf Office Hours:} {\small M 2-3:30, R 3-4:30}
\item {\bf Email:} [email protected]
\end{itemize}
\end{minipage}
~\\~\\~\\~\\
\noindent{\large\bf Teaching Fellow:} Ms.\ Katherine Missimer\\[0.75\baselineskip]
\begin{minipage}[t]{0.60\textwidth}
\begin{itemize}
\item {\bf Office Hours:} {\small W 4-5:30, F 5-6:30}
\item {\bf Office Hours Location:} Undergrad Lab, EMA 302
\item {\bf Lab Tutoring Hours:} {\small F 3-5.}
\item {\bf Email:} [email protected]
\end{itemize}
\end{minipage}
\section*{Overview of the Course}
This course is targeted at students who require a basic level of
proficiency in working with and analyzing data. The course emphasizes
practical skills in working with data, while introducing students to a
wide range of techniques that are commonly used in the analysis of data,
such as clustering, classification, regression, and network analysis.
The goal of the class is to provide to students a hands-on understanding
of classical data analysis techniques and to develop proficiency in
applying these techniques in a modern programming language (Python).
Broadly speaking, the course breaks down into three main components,
which we will take in order of increasing complication: (a)
unsupervised methods; (b) supervised methods; and (c) methods for
structured data.
Lectures will present the fundamentals of each technique; focus is not
on the theoretical underpinnings of the methods, but rather on helping
students understand the practical settings in which these methods are
useful. Class discussion will study use cases and will go over relevant
Python packages that will enable the students to perform hands-on
experiments with their data.
Prerequisites: Students taking this class must have some prior familiarity with
programming, at the level of CS 105, 108, or 111, or equivalent. CS
132 or equivalent (MA 242, MA 442) is required. CS 112 is also helpful.
\section*{Learning Outcomes}
Students who successfully complete this course will be proficient in
data acquisition, manipulation, and analysis. They will have good
working knowledge of the most commonly used methods of clustering,
classification, and regression. They will also understand the
efficiency issues and systems issues related to working on very large
datasets.
\section*{Readings}
There is no text for the course. Lecture notes will be posted online.
Some of the lectures are based on \emph{Introduction to Data Mining,} by
Tan, Steinbach and Kumar. This is a good place to go for more detail if
something is not clear.
Some other recommended texts are:
\begin{enumerate}
\item Python for Data Analysis
(http://shop.oreilly.com/product/0636920023784.do). This is the
definitive text for \emph{Pandas} which we will use quite a bit.
\item Programming Collective Intelligence (http://shop.oreilly.com/product/9780596529321.do)
\end{enumerate}
\section*{Web Resources}
The slides I use are actually executable python scripts, using the
\texttt{jupyter notebook.} You can
download and execute the lectures on your own computer, and you can
modify them any way you'd like, play around with them, experiment, etc.
The slides I use in lecture are published on \texttt{github.} The
repository is
\url{https://github.com/mcrovella/CS505-Data-Science-in-Python}. If you want
to access the repository using \texttt{git,} please feel free. If you
find a bug, feel free to submit a pull request.
\section*{Homeworks and Project}
\begin{enumerate}
\item There will nine homework assignments. In a typical
assignment you will
analyze one or more datasets using the tools and techniques presented in
class.
Homeworks will be submitted via \texttt{github}. For this, we
need your github account (create one if you don't already have it).
After you have created it, fill out the form at
\url{https://goo.gl/forms/8W0SOdvMn07UKdip2} to let us know what it is.
You are expected to work individually on homeworks.
\item In addition, there will be a final project. For the project you
will extract some
knowledge or conclusions from the analysis of dataset of your choice. The analysis
will be done using a subset of the methods we described in class. The
final project will require a proposal, two progress reports, and a final
presentation in poster form.
The project will have three essential components: 1) a data collection
piece (which may involve crawling or calls to an API, combining data
from different sources etc), 2) a data analysis piece (which will
involve applying different techniques we described in class for the
analysis) and 3) a conclusion component (where the results of the data
analysis will be drawn). The students will submit a 5-page report
explaining clearly all the three components of their project. Finally a
poster presentation will be required where the students will be prepare
to present their effort and results in front of their poster.
As an example, you may choose to collect data from Twitter related
to a specific topic (e.g., Ebola virus) and then measure the intensity
of posts about a topic in different areas of the world etc. Other
examples of projects may include (but are not limited to): analysis of
MBTA data, analysis of NYC data, crawling of YouTube (or other social
media data) and analysis of social behavior like trolling, bullying
etc.
The project is due by the last day of class (December 8). The project presentations will be
given in the form of a final poster explaining components 1, 2 and 3 of
the project.
You are expected to work in teams of two on the final project. I will
leave it up to you to form teams on your own, but everyone must work in
a team.
\end{enumerate}
\section*{Piazza}
We will be using Piazza for class discussion. The system is really well
tuned to getting you help fast and efficiently from classmates, Ms.\ Missimer,
and myself. Rather than emailing questions to the teaching staff,
I encourage you to post your questions on Piazza. Our class Piazza
page is at: \url{https://piazza.com/bu/fall2016/cs505/home}.
We will also use Piazza for distributing materials
such as homeworks and solutions.
When someone posts a question on Piazza, if you know the answer, please
go ahead and post it. However pleased \emph{don't} provide answers to homework
questions on Piazza. It's OK to tell people \emph{where to look} to
get answers, or to correct mistakes; just don't provide actual solutions
to homeworks.
\section*{Programming Environment}
We will use \texttt{python} as the language for teaching and for
assignments that require coding. Instructions for installing and
using Python are on Piazza.
\section*{Course and Grading Administration}
Homeworks are due at 7pm on Fridays.
Assignments will be submitted using \texttt{github}. Ms.\ Missimer will
explain how to submit assignments.
\emph{NOTE: IMPORTANT:} Late assignments \textbf{WILL NOT} be accepted.
However, you may submit \textbf{one} homework up to 3 days late. You
\textbf{must} email Ms.\ Missimer before the deadline if you intend to
submit a homework late.
Final grades will be computed based on the following:
\begin{description}
\item[50\%] Homework assignments.
\item[50\%] Final Project
\end{description}
The exact cutoffs for final grades will be determined after the class is
complete.
\newpage
\section*{Academic Honesty}
You may discuss homework assignments with classmates, but you are
solely responsible for what you turn in. Collaboration in the form of
discussion is allowed, but all forms of cheating (copying parts of a
classmate's assignment, plagiarism from books or old posted solutions)
are NOT allowed. We -- both teaching staff and students -- are expected
to abide by the guidelines and rules of the Academic Code of Conduct
(which is at
\url{http://www.bu.edu/academics/policies/academic-conduct-code/}).
Graduate students must also be aware of and abide by the GRS Academic
Conduct code at \url{http://www.bu.edu/cas/students/graduate/forms-policies-procedures/academic-discipline-procedures/}.
You can probably, if you try hard enough, find solutions for homework
problems online. Given the nature of the Internet, this is
inevitable. Let me make a couple of comments about that:
\begin{enumerate}
\item If you are looking online for an answer because you don't know how
to start thinking about a problem, talk to Ms.\ Missimer or myself, who may be
able to give you pointers to get you started. Piazza is great for
this -- you can usually get an answer in an hour if not a few minutes.
\item If you are looking online for an answer because you want to see if
your solution is correct, ask yourself if there is some way to verify
the solution yourself. Usually, there is. You will understand what you have done
\emph{much} better if you do that.
So ... it would be better to simply submit what you have at the deadline
(without going online to cheat) and plan to allocate more time for
homeworks in the future.
\end{enumerate}
\newpage
\section*{Course Schedule}
\small
\begin{centering}
\begin{tabular}{||l|p{3in}|l|l|l||}
\hline\hline
Date & Topics & Reading & Assigned & Due \\
\hline\hline
% complete basics of python
9/6 & Introduction to Python & & HW 0 & \\
% more sophisticated; prep for homewok 0
% get pandas and more from
% http://twiecki.github.io/blog/2014/11/18/python-for-data-science/
% also cover APIs - can use some material from CS211 ?
% also cover basic visualization
9/8 & Essential Tools (Git, Jupyter Notebook, Pandas) & & & \\
\hline
% doesnt exist yet; look at textbooks
9/13 & Probability and Statistics Refresher & & & HW 0 \\
% doesnt exist yet; also should develop a homework for this material
% h/w could use APIs and do simple statistics on the retrieved data.
9/15 & Linear Algebra Refresher & & & \\
\hline
% good material but need to integrate PDF and notebook
% needs some additional examples too
9/20 & Numpy, Scikit-learn, Distance and Similarity Functions & & & \\
% & Data Exploration & & & \\
% current slides not really usable b/c they focus on dynamic time
% warping. some OK points, but needs a re-do based on books
9/22 & Intro to Timeseries & & HW 1.1& \\
% Homework 1 is OK here. - distance functions, APIs, timeseries.
\hline
% this is a set of slides in PDF. good basics of clustering. good
% coverage of k-means and k-means++ - just need to convert to notebook
9/27 & Clustering I: k-means & & & \\
% this one is excellent - set in context of sklearn - practical and
% includes examples
9/29 & Clustering II: In practice & &HW 1.2 & \\
9/30 &&&& HW 1.1 \\
\hline
% pretty good PDF just need to convert to notebook
10/4 & Clustering III: Hierarchical Clustering & & & \\
% EM and GMM; short but good and its already started as a notebook
10/6 & Clustering IV: GMM and Expectation Maximization& & HW 2.1, 2.2 & \\
% Homework 2 depends on k-means, hierarchical, and GMM clustering
10/7 &&&& HW 1.2 \\
\hline
10/11 & NO CLASS; Monday Schedule & & & \\
% basically in OK shape; already a notebook, needs a smoothing pass
10/13 & Singular Value Decomposition I : Low Rank Approximation & & & \\
10/7 &&&& HW 2.1 \\
\hline
% OK but key material is in PDF. needs conversion to notebook
% existing notebook is mainly a sketch of next lecture
10/18 & SVD II: Dimensionality Reduction &&&\\
% there is a lot of good basic material here -- TF/IDF and SVD
% but needs LOTS of expansion -- explain ideas behind SVD (not currently
% there)
%
% HW 3.1 is scraping, 3.2 is regression
10/20 & SVD III: Anomaly Detection & & HW 3.1 & \\
10/21 &&&& HW 2.2 \\
\hline
% there is a guest lecture here by Davide on web scraping
% (this is good stuff and
% could be much earlier in course) - Larissa gave lecture - got Davide's
% code - but still needs update to python 3 / BS 4
10/25 & Web Scraping & & & \\
% intro to classification - chapter 4 of tan/steinbach/kumar
10/27 & Classification I: Decision Trees & & & \\
10/28 &&&& \\
\hline
% PDF plus jupyter notebook
11/1 & Classification II: k-Nearest Neighbors & & &\\
% PDF plus jupyter notebook (skip naive bayes?)
11/3 & Classification III: Naive Bayes, SVM & & &\\
11/4 &&&& HW 3.1, Proj Proposal\\
\hline
11/8 & Regression I: Linear Regression && HW 3.2 &\\
% homework 3 OK here; scraping and logistic regression
11/10 & Regression II: Logistic Regression & & & \\
% progr report here is "how much have you scraped?"
11/11 &&&& Prog Report 1\\
\hline
% need to cover mapreduce somewhere! Lecture 21 in spring16
% also spark in lecture 23
11/15 & Regression III: More Linear Regression & & &\\
11/17 & Recommender Systems & & HW 4 & \\
% HW 3.2 is regression
11/18 &&&& HW 3.2\\
\hline
% This prog report is the same as the single one last time
% for HDFS/MR
% also consider https://prezi.com/u0ukvqzpyh5p/apache-hadoop-petabytes-and-terawatts/#
11/22 & Map Reduce & & &Prog report 2\\
11/24 & NO CLASS; Thanksgiving Break & & & \\
\hline
% HW 5 is on map reduce; requires triangle counting
11/29 & Network Analysis I & & & \\
12/1 & Network Analysis II &&HW 5 &\\
12/2 &&&& HW 4\\
\hline
12/6 & Graph Clustering, Text Analysis & & & \\
12/8 & Poster Session & & &\\
\hline
12/12 &&&& HW 5\\
\hline\hline
\end{tabular}\\
\end{centering}
\end{document}
o