forked from rdpeng/RepData_PeerAssessment1
-
Notifications
You must be signed in to change notification settings - Fork 0
/
PA1_template.Rmd
182 lines (150 loc) · 6.58 KB
/
PA1_template.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
# Reproducible Research: Peer Assessment 1
## Preparation
The rest of the document assumes that the activity data is available in CSV format.
The CSV may be in a ZIP file, in which case it will be extracted from there.
The following variables need to be set to give the proper location and file names.
*Make changes to these variables if your locations or names differ.*
```{r Parameters}
activity.data.dir<-"."
activity.data.csv<-"activity.csv"
activity.data.zip<-"activity.zip"
```
At this point we'll define some utility functions that we'll use later.
```{r Functions}
# Format a number with coma separators and at most one decimal digit
fmt <- function(n) format(n,big.mark=',',digits=ceiling(log10(n))+1)
# Convert an interval given as an HHMM decimal number to minutes.
interval2Minute <- function(i) i%/%100*60+i%%60
```
We'll make use of the Lattice plotting system below, hence we load it here.
```{r Load-libaries}
library(lattice)
```
## Loading and preprocessing the data
First define variables with full file names for ease of use:
```{r Full-file-names}
activity.file<-paste(activity.data.dir,activity.data.csv,sep="/")
activity.zip <-paste(activity.data.dir,activity.data.zip,sep="/")
```
```{r DataSource}
source <- ifelse(file.exists(activity.file),
activity.file,
sprintf("%s[%s]",activity.zip,activity.data.csv))
```
If the data has already been unzipped just load it, otherwise get it from the zip file.
```{r Load-the-data}
if (file.exists(activity.file)) {
data <- read.csv(activity.file)
} else {
data <- read.csv(unz(activity.zip,activity.data.csv))
}
```
```{r Load-outcome}
load.status=ifelse(length(data)>0,"successfully loaded","failed to load")
```
**We `r load.status` the data from `r source`**.
The data looks like this:
```{r Show-data-shape}
str(data)
```
## What is mean total number of steps taken per day?
First let's aggregate the steps taken broken down by day. Note that this version of
aggregate ignores missing values by default.
```{r Aggregate-steps}
steps.day<-aggregate(steps ~ date,data,sum)
```
We can calculate the mean and median of the totals:
```{r Calculate-mean}
steps.mean <-mean(steps.day$steps)
steps.median<-median(steps.day$steps)
```
Now lets draw a histogram of # of steps taken on a day by how many days we took that many steps.
The dashed line is the mean number of steps taken on a day.
```{r Fig1-Steps-per-day}
hist(steps.day$steps,breaks=20,main="",ylab="# of days",xlab="Total steps / day",col="red")
lines(c(steps.mean,steps.mean),c(-5,15),lty=3,lwd=3)
```
The **mean** of the steps taken per day is **`r fmt(steps.mean)`**
while the **median** is **`r fmt(steps.median)`**.
## What is the average daily activity pattern?
Let's compute the average number of steps taken accross all days in a given
5 minute interval. Also construct a vector of unique interval identifiers; we'll
have occasion to use this variable again below.
```{r Average-by-interval}
iMean<-function(i) round(mean(data[data$interval==i,'steps'],na.rm=T))
intervals <- unique(data$interval)
meansByInterval<-data.frame(interval=intervals,avg=sapply(intervals,iMean))
```
All we need to do now is plot it; Lattice will be used.
```{r Fig2-Daily-Activity-Pattern}
xyplot(avg ~ interval,meansByInterval,type='l',ylab='Number of steps',xlab='Interval')
```
The maxium average across all intervals is now computed as well as the set of intervals
in which it occurs:
```{r Max-average}
maxAverage<-max(meansByInterval$avg)
maxAvgIntervals<-meansByInterval[meansByInterval$avg==maxAverage,'interval']
```
The maximum average number of steps in any one interval is **`r maxAverage`** and
it occurs in the interval(s) **`r maxAvgIntervals`.**
## Imputing missing values
We notice that there are many observations with no data for steps. How many?
```{r Count-NAs}
missing.values=sum(is.na(data$steps))
missing.pct=missing.values/length(data$steps)*100
```
There are in fact **`r fmt(missing.values)`** missing values, or
**`r fmt(missing.pct)`%** of the observations.
This might have a non-negligible effect on our computations so let's try to
replace the missing measurements with some reasonable value. An easy choice is
to take the mean for the given interval accross all days.
I don't have time right now to think of a more elegant way to do this, so I'll
do it with a loop:
```{r Replace-NAs}
I.data <- data
for (intervalID in intervals) {
I.data[is.na(data$steps) & data$interval==intervalID,]$steps <-
mean(data[data$interval==intervalID,'steps'],na.rm=T)
}
```
Let's repeat what we did before for total number of steps with the imputed data.
```{r Imputed-data-agregation-means-and -histogram}
I.steps.day<-aggregate(steps ~ date,I.data,sum)
I.steps.mean <-mean(I.steps.day$steps)
I.steps.median<-median(I.steps.day$steps)
```
```{r Fig3-Steps-per-day-Imputed-Values}
hist(I.steps.day$steps,breaks=20,main="",ylab="# of days",xlab="Total steps / day",col="red")
lines(c(I.steps.mean,I.steps.mean),c(-5,25),lty=3,lwd=3)
```
Now using the modfied data the **mean** of the steps taken per day is **`r fmt(I.steps.mean)`**
while the **median** is **`r fmt(I.steps.median)`**.
What we observe is that the mean and median changed only slightly comapred to computations
on the raw data (this is no surprise since we replaced missing values with the mean value
accross all other days). However the *distribution* has changed dramatically as is
confirmed by jus a glance at the two histograms.
## Are there differences in activity patterns between weekdays and weekends?
For some more analysis let's add a variable to the data to diffrentiate between
weekdays and weekends.
```{r Insert-day-types}
I.data$day<-factor(weekdays(as.Date(I.data$date)) %in% c("Saturday","Sunday"),
labels=c("weekday","weekend"))
```
The we'll build a data frame which has 3 columns:
1. The interval
2. A day type: weekday or weekend
3. The average number of steps for that interval accross all days of the given type.
We compute this in two steps: first for weekdays, then for weekends.
```{r Build-typed-means}
typedMean<-function(i,d) round(mean(I.data[I.data$interval==i & I.data$day==d,'steps']))
minutes<-interval2Minute(intervals)
typedMeansByInterval<-rbind(
data.frame(interval=intervals,day='weekday',avg=sapply(intervals,typedMean,'weekday')),
data.frame(interval=intervals,day='weekend',avg=sapply(intervals,typedMean,'weekend')))
```
OK ... so lets do some visialization. We'll use Lattice for this, because it
makes it so easy to get panels.
```{r Fig4-Averages-by-day-of-week}
xyplot(avg ~ interval | day,typedMeansByInterval,
layout=c(1,2),type='l',ylab='Number of steps',xlab='Interval')
```