-
Notifications
You must be signed in to change notification settings - Fork 0
/
Copy pathAssignment4.tex
222 lines (171 loc) · 13.5 KB
/
Assignment4.tex
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
\documentclass[a4paper]{report}
\usepackage{amsthm}
\usepackage{anysize}
\usepackage[english]{babel}
\usepackage{comment}
\usepackage[colorlinks=true,linkcolor=blue,urlcolor=blue]{hyperref}
\usepackage{paralist}
\marginsize{2cm}{2cm}{2cm}{2cm}
\setlength{\parindent}{0pt}
\theoremstyle{definition}
\newtheorem{exercise}{Exercise}
\newcommand{\blankline}{\par\vspace{5mm}}
\newcommand{\tab}{\hspace*{2em}}
\newcommand{\doublequote}{\texttt{"}}
\newcommand{\singlequote}{\char13}
\newcommand{\pipe}{$\vert$}
\begin{document}
\begin{center}
\textsc{\Large Project Big Data}
\blankline
\textbf{\large Assignment 4 (Report \& Presentation)}
\end{center}
You will continue working with data from the ``HUE bedtime procrastination study'' for which you will need to write a report (35\% of your final grade) and give a presentation (20\% of your final grade). A cleaned version of the hue data is available on Canvas (\texttt{\small hue\_week\_4\_2017.csv}), as well as another file that contains data from the post‐study questionnaire that participants filled
out at the end of the study (\texttt{\small hue\_questionnaire.csv}). This file contains the following information:
\blankline
\begin{tabular}{| l | p{0.66\textwidth} |}
\hline
\texttt{gender} & 1 = male, 2 = female \\
\texttt{age} & \\
\texttt{chronotype} & Single item (7-‐point scale), do you consider yourself more of a morning (1) or an evening person? (7) \\
\texttt{bp\_scale} & Dutch version of the Bedtime Procrastination Scale \href{http://selfregulationlab.nl/wp-content/uploads/2012/04/J-Health-Psychol-2016-Kroese-853-62.pdf}{[Kroese et al., 2014]} \\
\texttt{motivation} & Questions pertaining to personality traits related to procrastination. Single item (7-‐point scale), how motivated were you to go to bed on time each night? (1 = not motivated, 7 = very motivated) \\
\texttt{daytime\_sleepiness} & Dutch translation of the Epworth Sleepiness Scale (4-‐point scale from 0-‐3; 8 questions, values summed) \href{http://www.mwjohns.com/wp-content/uploads/2009/murray\_papers/a\_new\_method\_for\_measure\_daytime\_sleepiness\_the\_epworth\_sleepiness\_scale.pdf}{ESS; Johns, 1991}. \\
\texttt{self\_reported\_effectiveness} & Single item (7-‐point scale), do you feel more rested since the intervention \\
\hline
\end{tabular}
\blankline \noindent For the final assignment, you will use Python to examine this post‐questionnaire data in relation to the HUE data file, visualize trends and relationships, look for correlations between factors, test for significant differences between groups and build a regression model to predict bedtime delay. You are required to hand in your Python code to show that all transformations, visualizations and analyses have been done in Python. Your Python code will not be graded, however. You will bundle your findings in a neatly formatted final report of at most 6 pages (cover page excluded). This report should not contain any python code. The structure, content and grading criteria are explained further in this document. The report should be written in English. In order to perform the analyses, a number of transformations on the data still need to be done. To help you along, a Python template will be made available with a recommended structure for your Python code. The following steps must be implemented.
\begin{itemize}
\item Read the hue data file and the questionnaire data file into two
separate pandas DataFrames.
\item Create a new DataFrame that contains the following Series:
\begin{tabular}{| l | p{0.78\textwidth} |}
\hline
\texttt{ID} & Participant ID \\
\texttt{group} & Participant group (1 for experimental, 0 for control) \\
\texttt{delay\_nights} & The number of nights a participant delayed their bedtime (range: 0-12) \\
\texttt{delay\_time} & The mean time in seconds a participant delayed their bedtime (total delay in seconds, divided by the number of observations measured for each individual, rounded to nearest second). \\
\hline
\end{tabular}
Set the index of this new DataFrame to 'ID'. Note that there should only be a single row per participant ID.
\item Fill this new DataFrame by transforming data from the DataFrame about participants' bedtimes (from the hue data file).
\item Merge this new DataFrame with the post-‐questionnaire data and store the resulting DataFrame in a new variable. Perform this joining
operation of the two DataFrames in such a way that the resulting DataFrame only contains IDs that were present in both datasets.
\item Use the \href{http://docs.scipy.org/doc/scipy/reference/stats.html}{scipy.stats} package to calculate correlations between the following sets of determinants:
\begin{itemize}
\item bedtime procrastination scale (\texttt{bp\_scale}, a personality trait) and mean time spent delaying bedtime. Use the ``Pearson correlation tests'' to calculate the correlation.
\item age and mean time spent delaying bedtime. Use the ``Kendall rank correlation test'' to calculate the correlation.
\item mean time spent delaying bedtime and daytime sleepiness. Use the ``Pearson correlation test'' to calculate the correlation.
\end{itemize}
\item Use the \href{http://docs.scipy.org/doc/scipy/reference/stats.html}{scipy.stats} package to determine whether there are significant differences between the experimental group and the control group in terms of:
\begin{itemize}
\item the number of nights participants delayed their bedtime
\item the time participants spent in bed each night
\item the mean time participants spent delaying their bedtime
\end{itemize}
Use knowledge gained in the course `Statistics' to determine which statistical test is appropriate: the t-test or the Wilcoxon rank-sum test. Explain your choice in the Data analyses section of your report.
\item For bonus points: When interpreting the findings from the comparisons, adjust the significance level (alpha) to account for an inflation of Type I errors due to performing multiple comparisons. For example, you could use a Bonferroni correction (feel free to consult a reference work on how to adjust the significance level to counteract the inflation of Type I errors).
\item Formulate and concisely argue for a hypothesis about which factor or factors (max. three) you believe would best predict delay\_time. Write your hypothesis down in the Data analyses section of your report. Note that you should theorize about why you think these factors might be good predictors, and not use stepwise regression to identify the strongest predictors.
\item Use \texttt{\small statsmodels.api} to build a regression model that uses your three hypothesized determinants to predict \texttt{delay\_time} (see page 1 of this document). Make sure that \texttt{delay\_time} is not included in the list of predictors. Again, do not use stepwise regression, but use the regression model to test your hypothesis.
\item Create three distinct, meaningful, well-crafted visualizations that either provide insight into the data, or help support your conclusions. This means creating three different kinds of plots (not three boxplots, or three scatterplots for example).
\item Interpret and discuss your findings in the Discussion section of your report.
\item Write a succinct conclusion.
\end{itemize}
Report your findings in a separate document. This document should contain
the following sections:
\begin{description}
\item[Introduction (100 words)]
Describe concisely what your report is about. State the research
question
or research questions with which the report is concerned.
\item[Data description and exploration (500 words)]
Describe the variables in your final, merged DataFrame: what does each
value represent?
For each variable, report appropriate descriptives (e.g., mean and
standard deviation, or frequencies, etc.) Mention strange values, and
possible outliers.
This is a good place to include one or more visualizations that help the
reader to better understand the data.
\item[Data analyses (400 words)]
Describe the statistical analyses that you performed, and explain why
those were the appropriate tests for the data. Report your findings,
but don't discuss them yet. This is another place where visualizations can provide insight into the findings.
\item[Discussion (400 words)]
Discuss your findings. Are correlations high or low? Does that make sense? Are differences significant? How robust do you think your findings are?
\item[Conclusion (100 words)]
What conclusions could be drawn from your analyses? Prefer nuanced statements to bold, sweeping claims.
\end{description}
The final assignment should be done in the same groups as in the previous weeks, and must be handed in via Canvas on June 29 by 23:59 hours. There will be two separate assignments on Canvas, one for your Python code and one for your report. You must submit one file to each assignment (one python file and one PDF document, respectively).
\blankline
The report should be formatted as a PDF document and must have the structure as described above. The word count per section is the maximum
number of words allowed, excluding tables, captions and figures (note that this word count is strict). A table of contents is not necessary
and should be left out. The report contains page numbers on each page in the bottom right corner. The report must have a front page containing
the names and student numbers of the authors, the title of the report, the name of the course, and the date of submission.
\blankline
Each group should hand in only one solution. If two Python files or two PDF files are submitted, only the first submission will be graded.
\section*{Criteria for final report and presentation}
\paragraph{Criteria for final report}
\begin{enumerate}
\item Quality of written text (10\%)
\begin{compactitem}
\item The text is free from grammatical errors and spelling mistakes
\item The text has a clear and logical structure
\end{compactitem}
\item Quality of data analyses (40\%)
\begin{compactitem}
\item Python code is complete and correct (see individual assignments for the credits that can be earned per component)
\item The reported descriptive statistics are complete (for example, not just the mean, but also the standard deviation)
\item Assumptions for statistical tests have been checked
\item The appropriate statistical tests have been applied. When there is doubt, the choice is motivated.
\end{compactitem}
\item Quality of the visualizations (20\%)
\begin{compactitem}
\item All graphs have descriptive labels on their axes
\item The values on the axes have units
\item The intervals of the values on the axes are suitable
\item The graph is clear and legible, and uses an appropriate font size
\item When appropriate, graphs contain a legible and clear legend
\item The use of color in the graphs is helpful for understanding the graphs
\end{compactitem}
\item Quality of the interpretations of the analyses (20\%)
\begin{compactitem}
\item Interpretations are appropriately nuanced
\item Interpretations are adequately motivated
\item Alternative interpretations are discussed
\end{compactitem}
\item Formal requirements (10\%)
\begin{compactitem}
\item The report should be handed in as a PDF document
\item Both the Python file and the final report have the correct filenames
\item The report contains a cover page that includes the names and student numbers of the authors, the name of the course, and the date on which the report was handed in
\item The report contains page numbers in the bottom right corner
\end{compactitem}
\end{enumerate}
\newpage
\paragraph{Criteria for the presentation}
\begin{enumerate}
\item Quality of the content (60\%)
\begin{compactitem}
\item The presentation contains a clear research statement.
\item The presentation is focused and to the point. It is tailored towards its target audience (students Business Analytics before they take Project Big Data).
\item Through the presentation, it becomes clear what the motivation for the research statement is.
\item The presentation contains visualizations of the data that support the main narrative of the presentation. These visualizations should be made with the presentation (projector, etc.) in mind.
\item The presentation has a clear and transparent structure.
\item The presentation offers relevant points for discussion on the basis of the performed analyses.
\end{compactitem}
\item Presentation skills (30\%)
\begin{compactitem}
\item The speaker speaks clearly, audibly, and with good pace.
\item The speaker keeps everyone in the audience engaged (eye contact, etc.).
\item The speaker uses his or her hands for non-verbal communication.
\item The speaker uses body language to convey confidence.
\item The speaker stays within the allotted time.
\item The speaker responds to questions adequately.
\end{compactitem}
\item Formal requirements (10\%)
\begin{compactitem}
\item The presentation is accompanied with a slide deck. The first slide contains the title of the presentation, the speakers' names, the date, and the title of the course.
\item The slides contain sources where appropriate (e.g., citations, borrowed figures, etc.)
\end{compactitem}
\end{enumerate}
\end{document}