-
Notifications
You must be signed in to change notification settings - Fork 24
/
12_matricies-manipulation.qmd
500 lines (342 loc) · 13.1 KB
/
12_matricies-manipulation.qmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
# Manipulating Vectors and Matrices {#sec-rmatrices}
```{r}
#| include: false
#| message: false
#| warning: false
library(dplyr)
library(readr)
library(haven)
library(ggplot2)
```
::: {.callout .callout-note}
Module originally written by Shiro Kuriwaki and Yon Soo Park
:::
### Motivation {.unnumbered}
[Nunn and Wantchekon (2011)](https://dash.harvard.edu/bitstream/handle/1/11986331/nunn-slave-trade.pdf) -- "The Slave Trade and the Origins of Mistrust in Africa"[^12_matricies-manipulation-1] -- argues that across African countries, the distrust of co-ethnics fueled by the slave trade has had long-lasting effects on modern day trust in these territories. They argued that the slave trade created distrust in these societies in part because as some African groups were employed by European traders to capture their neighbors and bring them to the slave ships.
[^12_matricies-manipulation-1]: [Nunn, Nathan, and Leonard Wantchekon. 2011. “The Slave Trade and the Origins of Mistrust in Africa.” American Economic Review 101(7): 3221–52.](https://dash.harvard.edu/bitstream/handle/1/11986331/nunn-slave-trade.pdf)
Nunn and Wantchekon use a variety of statistical tools to make their case (adding controls, ordered logit, instrumental variables, falsification tests, causal mechanisms), many of which will be covered in future courses. In this module we will only touch on their first set of analysis that use Ordinary Least Squares (OLS). OLS is likely the most common application of linear algebra in the social sciences. We will cover some linear algebra, matrix manipulation, and vector manipulation from this data.
### Where are we? Where are we headed? {.unnumbered}
Up till now, you should have covered:
- R basic programming
- Data Import
- Statistical Summaries.
Today we'll cover
- Matrices & Dataframes in R
- Manipulating variables
- And other `R` tips
## Read Data
```{r}
library(haven)
nunn_full <- read_dta("data/input/Nunn_Wantchekon_AER_2011.dta")
```
Nunn and Wantchekon's main dataset has more than 20,000 observations. Each observation is a respondent from the Afrobarometer survey.
```{r}
head(nunn_full)
colnames(nunn_full)
```
First, let's consider a small subset of this dataset.
```{r}
#| include: false
#| eval: false
set.seed(02138)
sample <- sample_n(nunn_full, 10)
sample <- select(sample, trust_neighbors, exports, ln_exports, export_area, ln_export_area)
write_dta(sample, "data/input/Nunn_Wantchekon_sample.dta")
```
```{r}
nunn <- read_dta("data/input/Nunn_Wantchekon_sample.dta")
```
```{r}
nunn
```
## data.frame vs. matricies
This is a `data.frame` object.
```{r}
class(nunn)
```
But it can be also consider a matrix in the linear algebra sense. What are the dimensions of this matrix?
```{r}
nrow(nunn)
```
`data.frame`s and matrices have much overlap in `R`, but to explicitly treat an object as a matrix, you'd need to coerce its class. Let's call this matrix `X`.
```{r}
X <- as.matrix(nunn)
```
What is the difference between a `data.frame` and a matrix? A `data.frame` can have columns that are of different types, whereas --- in a matrix --- all columns must be of the same type (usually either "numeric" or "character").
You can think of data frames maybe as matrices-plus, because a column can take on characters as well as numbers. As we just saw, this is often useful for real data analyses.
Another way to think about data frames is that it is a type of list. Try the `str()` code below and notice how it is organized in slots. Each slot is a vector. They can be vectors of numbers or characters.
```{r}
#| eval: false
# enter this on your console
str(cen10)
```
## Handling matricies in `R`
You can easily transpose a matrix
```{r}
X
t(X)
```
What are the values of all rows in the first column?
```{r}
X[, 1]
```
What are all the values of "exports"? (i.e. return the whole "exports" column)
```{r}
X[, "exports"]
```
What is the first observation (i.e. first row)?
```{r}
X[1, ]
```
What is the value of the first variable of the first observation?
```{r}
X[1, 1]
```
Pause and consider the following problem on your own. What is the following code doing?
```{r}
X[X[, "trust_neighbors"] == 0, "export_area"]
```
Why does it give the same output as the following?
```{r}
X[which(X[, "trust_neighbors"] == 0), "export_area"]
```
Some more manipulation
```{r}
X + X
```
```{r}
X - X
```
```{r}
t(X) %*% X
```
```{r}
cbind(X, 1:10)
```
```{r}
cbind(X, 1)
```
```{r}
colnames(X)
```
## Variable Transformations
`exports` is the total number of slaves that were taken from the individual's ethnic group between Africa's four slave trades between 1400-1900.
What is `ln_exports`? The article describes this as the natural log of one plus the `exports`. This is a transformation of one column by a particular function
```{r}
log(1 + X[, "exports"])
```
Question for you: why add the 1?
Verify that this is the same as `X[, "ln_exports"]`
## Linear Combinations
In Table 1 we see "OLS Estimates". These are estimates of OLS coefficients and standard errors. You do not need to know what these are for now, but it doesn't hurt to getting used to seeing them.
![](images/nunn_wantchekon_table1.png)
A very crude way to describe regression is through linear combinations. The simplest linear combination is a one-to-one transformation.
Take the first number in Table 1, which is -0.00068. Now, multiply this by `exports`
```{r}
-0.00068 * X[, "exports"]
```
Now, just one more step. Make a new matrix with just exports and the value 1
```{r}
X2 <- cbind(1, X[, "exports"])
```
name this new column "intercept"
```{r}
colnames(X2)
```
```{r}
colnames(X2) <- c("intercept", "exports")
```
What are the dimensions of the matrix `X2`?
```{r}
dim(X2)
```
Now consider a new matrix, called `B`.
```{r}
B <- matrix(c(1.62, -0.00068))
```
What are the dimensions of `B`?
```{r}
dim(B)
```
What is the product of `X2` and `B`? From the dimensions, can you tell if it will be conformable?
```{r}
X2 %*% B
```
What is this multiplication doing in terms of equations?
```{r}
#| echo: false
#| eval: false
## FYI regression in Table 1 (without cluster SEs)
form <- "trust_neighbors ~ exports + age + age2 + male + urban_dum + factor(education) + factor(occupation) + factor(religion) + factor(living_conditions) + district_ethnic_frac + frac_ethnicity_in_district + isocode"
lm_1_1 <- lm(as.formula(form), nunn_full)
summary(lm_1_1)
```
## Matrix Basics
Let's take a look at Matrices in the context of R
```{r}
#| message: false
cen10 <- read_csv("data/input/usc2010_001percent.csv")
head(cen10)
```
What is the dimension of this dataframe? What does the number of rows represent? What does the number of columns represent?
```{r}
#| message: false
dim(cen10)
nrow(cen10)
ncol(cen10)
```
What variables does this dataset hold? What kind of information does it have?
```{r}
#| message: false
colnames(cen10)
```
We can access column vectors, or vectors that contain values of variables by using the \$ sign
```{r}
#| message: false
head(cen10$state)
head(cen10$race)
```
We can look at a unique set of variable values by calling the unique function
```{r}
#| message: false
unique(cen10$state)
```
How many different states are represented (this dataset includes DC as a state)?
```{r}
#| message: false
length(unique(cen10$state))
```
Matrices are rectangular structures of numbers (they have to be numbers, and they can't be characters).
A cross-tab can be considered a matrix:
```{r}
table(cen10$race, cen10$sex)
```
```{r}
cross_tab <- table(cen10$race, cen10$sex)
dim(cross_tab)
cross_tab[6, 2]
```
But a subset of your data -- individual values-- can be considered a matrix too.
```{r}
#| warning: false
# First 20 rows of the entire data
# Below two lines of code do the same thing
cen10[1:20, ]
cen10 |> slice(1:20)
# Of the first 20 rows of the entire data, look at values of just race and age
# Below two lines of code do the same thing
cen10[1:20, c("race", "age")]
cen10 |>
slice(1:20) |>
select(race, age)
```
A vector is a special type of matrix with only one column or only one row
```{r}
# One column
cen10[1:10, c("age")]
cen10 |>
slice(1:10) |>
select(c("age"))
# One row
cen10[2, ]
cen10 |> slice(2)
```
What if we want a special subset of the data? For example, what if I only want the records of individuals in California? What if I just want the age and race of individuals in California?
```{r}
# subset for CA rows
ca_subset <- cen10[cen10$state == "California", ]
ca_subset_tidy <- cen10 |> filter(state == "California")
all_equal(ca_subset, ca_subset_tidy)
# subset for CA rows and select age and race
ca_subset_age_race <- cen10[cen10$state == "California", c("age", "race")]
ca_subset_age_race_tidy <- cen10 |>
filter(state == "California") |>
select(age, race)
all_equal(ca_subset_age_race, ca_subset_age_race_tidy)
```
Some common operators that can be used to filter or to use as a condition. Remember, you can use the unique function to look at the set of all values a variable holds in the dataset.
```{r}
# all individuals older than 30 and younger than 70
s1 <- cen10[cen10$age > 30 & cen10$age < 70, ]
s2 <- cen10 |> filter(age > 30 & age < 70)
all_equal(s1, s2)
# all individuals in either New York or California
s3 <- cen10[cen10$state == "New York" | cen10$state == "California", ]
s4 <- cen10 |> filter(state == "New York" | state == "California")
all_equal(s3, s4)
# all individuals in any of the following states: California, Ohio, Nevada, Michigan
s5 <- cen10[cen10$state %in% c("California", "Ohio", "Nevada", "Michigan"), ]
s6 <- cen10 |> filter(state %in% c("California", "Ohio", "Nevada", "Michigan"))
all_equal(s5, s6)
# all individuals NOT in any of the following states: California, Ohio, Nevada, Michigan
s7 <- cen10[!(cen10$state %in% c("California", "Ohio", "Nevada", "Michigan")), ]
s8 <- cen10 |> filter(!state %in% c("California", "Ohio", "Nevada", "Michigan"))
all_equal(s7, s8)
```
## Checkpoint {.unnumbered}
#### 1 {.unnumbered}
Get the subset of cen10 for non-white individuals (Hint: look at the set of values for the race variable by using the unique function)
```{r}
# Enter here
```
#### 2 {.unnumbered}
Get the subset of cen10 for females over the age of 40
```{r}
# Enter here
```
#### 3 {.unnumbered}
Get all the serial numbers for black, male individuals who don't live in Ohio or Nevada.
```{r}
# Enter here
```
## Exercises {.unnumbered}
#### 1 {.unnumbered}
Let $$\mathbf{A} = \left[\begin{array}
{rrr}
0.6 & 0.2\\
0.4 & 0.8\\
\end{array}\right]
$$
Use R to write code that will create the matrix $A$, and then consecutively multiply $A$ to itself 4 times. What is the value of $A^{4}$?
```{r}
## Enter yourself
```
Note that R notation of matrices is different from the math notation. Simply trying `X^n` where `X` is a matrix will only take the power of each element to `n`. Instead, this problem asks you to perform matrix multiplication.
#### 2 {.unnumbered}
Let's apply what we learned about subsetting or filtering/selecting. Use the `nunn_full` dataset you have already loaded
a) First, show all observations (rows) that have a `"male"` variable higher than 0.5
```{r}
## Enter yourself
```
b) Next, create a matrix / dataframe with only two columns: `"trust_neighbors"` and `"age"`
```{r}
## Enter yourself
```
c) Lastly, show all values of `"trust_neighbors"` and `"age"` for observations (rows) that have the "male" variable value that is higher than 0.5
```{r}
## Enter yourself
```
#### 3 {.unnumbered}
Find a way to generate a vector of "column averages" of the matrix `X` from the Nunn and Wantchekon data in one line of code. Each entry in the vector should contain the sample average of the values in the column. So a 100 by 4 matrix should generate a length-4 matrix.
#### 4 {.unnumbered}
Similarly, generate a vector of "column medians".
#### 5 {.unnumbered}
Consider the regression that was run to generate Table 1:
```{r}
form <- "trust_neighbors ~ exports + age + age2 + male + urban_dum + factor(education) + factor(occupation) + factor(religion) + factor(living_conditions) + district_ethnic_frac + frac_ethnicity_in_district + isocode"
lm_1_1 <- lm(as.formula(form), nunn_full)
# The below coef function returns a vector of OLS coefficiants
coef(lm_1_1)
```
First, get a small subset of the nunn_full dataset. This time, sample 20 rows and select for variables `exports`, `age`, `age2`, `male`, and `urban_dum`. To this small subset, add (`bind_cols()` in tidyverse or `cbind()` in base R) a column of 1's; this represents the intercept. If you need some guidance, look at how we sampled 10 rows selected for a different set of variables above in the lecture portion.
```{r}
# Enter here
```
Next let's try calculating predicted values of levels of trust in neighbors by multiplying coefficients for the intercept, `exports`, `age`, `age2`, `male`, and `urban_dum` to the actual observed values for those variables in the small subset you've just created.
```{r}
# Hint: You can get just selected elements from the vector returned by coef(lm_1_1)
# For example, the below code gives you the first 3 elements of the original vector
coef(lm_1_1)[1:3]
# Also, the below code gives you the coefficient elements for intercept and male
coef(lm_1_1)[c("(Intercept)", "male")]
```