-
Notifications
You must be signed in to change notification settings - Fork 1
/
Copy pathindex.Rmd
257 lines (172 loc) · 23.4 KB
/
index.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
---
title: "Lab notes for Statistics for Social Sciences II: Multivariate Techniques"
subtitle: "BSc in International Studies and BSc in International Studies & Political Science, Carlos III University of Madrid"
author: "Eduardo García Portugués"
date: "`r Sys.Date()`, v12.3"
knit: "bookdown::render_book"
documentclass: book
bibliography: SSS2-UC3M.bib
biblio-style: apalike
link-citations: yes
site: bookdown::bookdown_site
---
# Introduction {#intro}
<!--
(Move up once decided what options)
cover-image: no
description: "Lab notes for Statistics for Social Sciences II: Multivariate Techniques"
github-repo: egarpor/SSS2-UC3M
apple-touch-icon: "touch-icon.png"
apple-touch-icon-size: 120
favicon: "favicon.ico"
-->
```{block2, type = 'rmdcaution'}
The **animations** of these notes will not be displayed the first time they are browsed^[The reason is because they are hosted at `https` websites with auto-signed SSL certificates.]. See for example Figure \@ref(fig:leastsquares). **To see them**, click on the caption's link *"Application also available [here](https://ec2-35-177-34-200.eu-west-2.compute.amazonaws.com/least-squares/)"*. You will get a warning from your browser saying that *"Your connection is not private"*. Click in *"Advanced"* and **allow an exception** in your browser (I guarantee you I will not do anything evil!). The next time the animation will show up correctly within the notes.
```
Welcome to the lab notes for *Statistics for Social Sciences II: Multivariate Techniques*. Along these notes we will see how to effectively implement the statistical methods presented in the lectures. The exposition we will follow is based on learning by analyzing datasets and **real-case studies**, always with the help of **statistical software**. While doing so, we will illustrate the key insights of some multivariate techniques and the adequate use of advanced statistical software.
Be advised that these notes are *neither an exhaustive, rigorous nor comprehensive treatment* of the broad statistical branch know as *Multivariate Analysis*. They are just a helpful resource to implement the specific topics covered in this limited course.
## Some course logistics {#intro-logistics}
- Lessons.
- International Studies, group 55. **Mondays 16:15-17:45 at INF-15.S.06**.
- International Studies, group 56. **Tuesdays 18:00-19:30 at INF-10.0.29**.
- International Studies & Political Science, group 45. **Mondays 18:00-19:30 at INF-15.S.04**.
- Office hours. **Tuesdays 11:00-13:00 at office 10.0.10** (access through 10.0.7 in *Campomanes* building). If they are incompatible with your schedule, send me an email to get an alternative appointment (preferable) or just drop by my office to see if I am available (I will remove this option if I get overwhelmed with queries).
- Grading.
- Continuous evaluation is 60% of the final grade. Scored by 2 partial exams and 1 group project, each accounting for a 20%.
- Final exam is 40%, covers contents mostly from the theoretical lectures.
- Partials. Cover contents from labs and lectures. The first will (likely) cover Topics 1-2. The second, Topics 3-5.
- Group project. The students must team up in groups of 4 ($\pm1$) in order to produce a report. This report must apply the methodology presented to a dataset of their choice. As a rule of thumb, all students in a group will be graded evenly and according to the ratio "quality project"/"group size". Specific details in Appendix \@ref(appendix-project).
## Software employed {#intro-software}
The software we will employ in this course is available in all UC3M computer labs, including INF-15.S.04, INF-15.S.06 and INF-10.0.29 (the first week only in INF-15.S.04). We will use two pieces of software:
- [`R`](https://www.r-project.org/). A free open-source software environment for statistical computing and graphics. Virtually all the statistical methods you can think of are available in `R`. Currently, it is *the* dominant statistical software (at least among statisticians).
- [`R Commander`](http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/). A Graphical User Interface (GUI) designed to make `R` accessible to non-specialists through friendly menus. Essentially, it translates simple menu instructions into `R` code.
The only thing you need to do to run `R Commander` in any UC3M computer is:
1. Run `'Start' -> 'R3.3.1' -> 'R3.3.1 (consola)'`. A black console will open. Do not panic!
2. Type inside
```{r, eval = FALSE}
library(Rcmdr)
```
Congratulations on your first piece of `R` code, you have just loaded a package!
3. If `Rcmdr` is installed, then `R Commander` will open automatically and you are ready to go. In case you accidentally close `R Commander`, type
```{r, eval = FALSE}
Commander()
```
If `Rcmdr` is not installed, then type
```{r, eval = FALSE}
install.packages("Rcmdr", dep = TRUE)
```
and say `'Yes'` to the next pop-ups regarding the installation of the personal library. This will download and install `Rcmdr` and all the related packages from a CRAN mirror (`'Spain (A Coruña) [https]'`, usually works fine -- try a different one if you experience problems). Wait for the downloading and installation of the packages. When it is done, just type
```{r, eval = FALSE}
library(Rcmdr)
```
and you are ready to go.
An important warning about UC3M computer labs:
```{block, type = 'rmdcaution'}
Every file you save locally (including installed packages) will be wiped out after you close your session. So be sure to save your valuable files at the end of the lesson.
The **exception** is the folder `'C:/TEMP'`, where all the files you save will be accessible for *everyone* that logs in the computer!
```
```{block2, type = 'rmdcaution'}
In UC3M computers, `R` and `R Commander` are only available in Spanish. To have them in English, you need to do a workaround:
1. Create a shortcut to `R` in your desktop. To do so, go to `'C:/Archivos de programa/R/R-3.3.1/bin/', right-click in `'R.exe'` and choose `'Enviar a' -> 'Escritorio (crear acceso directo)'`.
2. Modify the properties of the shortcut. Right-click on the shortcut and choose `'Propiedades'`. Then **append** to the `'Destino'` field the text `Language=en` (separated by a space, see Figure \@ref(fig:lang)). Click `'Aplicar'` and then `'OK'`.
3. Run that shortcut and then type `library(Rcmdr)`.
Tip: if you save the modified shortcut in `'C:/TEMP'`, it could be available the next time you log in.
```
Alternatively, you can bring your own laptop and save all your files in it, see Section \@ref(intro-installation).
## Why this software? {#intro-whysoftware}
There are many advanced commercial statistical software, such as `SPSS`, `Excel` (with commercial add-ons), `Minitab`, `Stata`, `SAS`, etc. We will rely on the combo `R` [@R-base] + `R Commander` [@R-Rcmdr] due to some noteworthy advantages:
1. **Free and open-source**. (Free as in beer, free as in speech.) No software licenses are needed. This means that you can readily use it outside UC3M computer labs, without limitations on the period or purpose of use.
2. **Scalable complexity and extensibility**. `R Commander` creates `R` code that you can see, and eventually understand. Once you begin to get a feeling of it, you will realize that is faster to type the right commands than to navigate through menus. In addition, `R Commander` has 39 high-quality plug-ins (September, 2016), so the procedures available through menus will not fall short easily.
3. **`R` is the leading computer language in statistics**. Any statistical analysis that you can imagine is already available in `R` through its almost 9000 free packages (September, 2016). Some of them contain a good number of ready-to-use datasets or methods for data acquisition from accredited sources.
4. `R Commander` produces **high-quality graphs easily**. `R Commander`, through the plug-in `KMggplot2`, interfaces the `ggplot2` library, which delivers high-quality, publication-level graphs ([sample gallery](http://www.r-graph-gallery.com/portfolio/ggplot2-package/)). It is considered as one of the best and more elegant graphing packages nowadays.
5. **Great report generation**. `R Commander` integrates `R Markdown`, which is a framework able to create `.html`, `.pdf` and `.docx` reports directly from the outputs of `R`. That means you can deliver high-quality, reproducible and beautiful reports with a little amount of effort. For example, these notes have been created with an extension of `R Markdown`.
In summary, `R Commander` eases the learning curve of `R` and provides a powerful way of creating and reporting statistical analyses. An intermediate knowledge in `R Commander` + `R` will improve notably your quantitative skills, therefore making an **important distinction in your graduate profile** (it is a fact that many social scientists tend to lack a proper quantitative formation). So I encourage you to take full advantage of this great opportunity!
## Installation in your own computer {#intro-installation}
You are allowed to bring your own laptop to the labs. This may have a series of benefits, such as admin privileges, saving all your files locally and a deeper familiarization with the software. But keep in mind:
```{block, type = 'rmdcaution'}
If you plan to use your personal laptop, **you** are responsible for the right setup of the software (and laptop) *prior* to the lab lesson.
```
Regardless of your choice, at some point you will probably need to run the software outside UC3M computer labs. This is what you have to do in order to install `R` + `R Commander` in your own computer:
1. In Mac OS X, download and install first [`XQuartz`](https://www.xquartz.org/) and log out and back on your Mac OS X account (this is an **important** step). Be sure that your Mac OS X system is up-to-date.
2. Download the latest version of `R` for [Windows](https://cran.r-project.org/bin/windows/base/release.html) or [Mac OS X](https://cran.r-project.org/bin/macosx/R-3.3.2.pkg).
3. Install `R`. In Windows, be sure to select the `'Startup options'` and then choose `'SDI'` in the `'Display Mode'` options. Leave the rest of installation options as default.
4. Open `R` (`'R x64 X.X.X'` in 64-bit Windows, `'R i386 X.X.X'` in 32-bit Windows and `'R.app'` in Mac OS X) and type:
```{r, eval = FALSE}
install.packages(c("Rcmdr", "RcmdrMisc", "RcmdrPlugin.TeachingDemos",
"RcmdrPlugin.FactoMineR", "RcmdrPlugin.KMggplot2"),
dep = TRUE)
```
Say `'Yes'` to the pop-ups regarding the installation of the personal library and choose the CRAN mirror (the server from which you will download packages). `'Spain (A Coruña) [https]'`, usually works fine -- try a different one if you experience problems.
5. To launch the `R Commander`, run `R` and then
```{r, eval = FALSE}
library(Rcmdr)
```
```{block, type = 'rmdcaution'}
**Mac OS X users**. To prevent an occasional freezing of `R` and `R Commander` by the OS, go to `'Tools' -> 'Manage Mac OS X app nap for R.app...'` and select `'off (recommended)'`.
```
If there is any Linux user, kindly follow the corresponding instructions [here](https://cran.r-project.org/) and [here](http://socserv.mcmaster.ca/jfox/Misc/Rcmdr/installation-notes.html).
By default `R` and `R Commander` will have menus and messages in the language of your OS. If you want them in English, a simple option is to change the OS language to English and reboot. If you want to stick with your OS language, another options are:
(ref:langtitle) Modification of the `R` shortcut properties in Windows.
- **Windows**. Create a shortcut in your desktop, either to `'R x64 X.X.X'` or to `'R i386 X.X.X'`. Add a distinctive descriptor to its name, for example `'R x64 X.X.X ENGLISH'`. Then `'Right mouse click'` on it, select `'Properties'` and **append** to the `'Target'` field the text `Language=en` (separated by a space, see Figure \@ref(fig:lang)). Analogously, use `Language=es` for Spanish, `Language=it` for Italian, etc. Use this shortcut to launch `R` (and then `R Commander`) in the chosen language. Click `'Apply'` and then `'OK'`.
```{r, lang, echo = FALSE, out.width = '45%', fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:langtitle)', cache = TRUE}
knitr::include_graphics("images/screenshots/lang.png")
```
- **Mac OS X**. Open `R.app` and simply run
```{r, eval = FALSE}
system("defaults write org.R-project.R force.LANG en_GB.UTF-8")
```
Then close `'R.app'` and relaunch it. Analogously, replace `en_GB` above by `es_ES` or `it_IT` if you want to switch back to Spanish or Italian, for example.
## `R Commander` basics {#intro-RCommander}
(ref:ScreenshotRcmdrtitle) Main window of `R Commander`, with the plug-ins `RcmdrPlugin.FactoMineR`, `RcmdrPlugin.KMggplot2` and `RcmdrPlugin.Demos` loaded.
```{r, ScreenshotRcmdr, echo = FALSE, out.width = '90%', fig.align = 'center', fig.pos = 'h!', fig.cap = '(ref:ScreenshotRcmdrtitle)', cache = TRUE}
knitr::include_graphics("images/screenshots/Rcmdr2.png")
```
When you start `R Commander` you will see a window similar to Figure \@ref(fig:ScreenshotRcmdr). This GUI has the following items:
1. **Drop-down menus**. They are pretty self-explanatory. A quick summary:
- `'File'`. Saving options for different files (`.R`, `.Rmd`, `.txt` output and `.RData`). The latter corresponds to saving the *workspace* of the session.
- `'Edit'`. Basic edition within the GUI text boxes (`'Copy'`, `'Paste'`, `'Undo'`, ...).
- `'Data'`. Import, manage and manipulate datasets.
- `'Statistics'`. Perform statistical analyses, such as `'Summaries'` and `'Fit models'`.
- `'Graphs'`. Compute the available graphs (depending on the kind of variables) for the active dataset.
- `'Models'`. Graphical, numerical and inferential analyses for the active model.
- `'Distributions'`. Operations for continuous/discrete distributions: sampling, computation of probabilities and quantiles, and plotting of density and distribution functions.
- `'Tools'`. Options for `R Commander`. Here you can `'Load Rcmdr plug-in(s)...'`, which enables to expand the number of menus available via plug-ins<!--^[For example, `'FactoMineR'` (collection of multivariate techniques), `'KMggplot2'` (produces elegant graphs using `ggplot2`) and `'Demos'` (set of pedagogical demos; `'Simple linear regression'` is specially relevant for the next chapter).]-->. `R Commander` will need to restart prior to loading a plug-in. (A minor inconvenience is that the text boxes in the GUI will be wiped out after restarting. The workspace is kept after restarting, so the models and datasets will be available -- but not selected as active.)
- `'Help'`. Several help resources.
2. **Dataset manipulation**. Select the active dataset among the list of loaded datasets. Edit (very basic) and view the active dataset.
3. **Model selector**. Select an active model among the available ones to work with.
4. **Switch to script or report mode**. Switch between the generated `R` code and the associated `R Markdown` code.
5. **Input panel**. The `R` code generated by the drop-down menus appears here and is passed to the output console. You can type code here and run it without using the drop-down menus.
6. **Submit/Generate report buttom**. Allows to pass and run *selected* `R` code to the output console (keyboard shortcut: `'Control' + 'R'`).
7. **Output console**. Shows the commands that were run (red) and their visible result (blue), if any.
8. **Messages**. Displays possible error messages, warnings and notes.
When you close `R Commander` you will be asked on what files to save: `'script file'` and `'R Markdown file'` (contents in the two tabs of panel 5, respectively), and `'output file'` (panel 7). If you want to save the workspace, you have to do it through the `'File'` menu.
```{block, type = 'rmdtip'}
Focus on **understanding the purpose of each element in the GUI**. When performing the real-case studies we will take care of explaining the many features of `R Commander` step by step.
```
## Datasets for the course {#intro-datasets}
This is a handy list with a small description and download link for all the relevant datasets used in the course. To download them, simply **save the link as a file** in your browser.
- `pisa.csv` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/pisa.csv)). Contains 65 rows corresponding to the countries that took part on the PISA study. Each row has the variables `Country`, `MeanMath`,` MathShareLow`, `MathShareTop`, `ReadingMean`, `ScienceMean`, `GDPp`, `logGDPp` and `HighIncome`. The `logGDPp` is the logarithm of the `GDPp`, which is taken in order to avoid scale distortions.
- `US_apportionment.xls` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/US_apportionment.xls)). Contains the 50 US states entitled to representation in the US House of Representatives. The recorded variables are `State`, `Population2010` and `Seats2013–2023`.
- `EU_apportionment.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/EU_apportionment.txt)). Contains 28 rows with the member states for the EU (`Country`), the number of seats assigned under different years (`Seats2011`, `Seats2014`), the Cambridge Compromise apportionment (`CamCom2011`), and the states population (`Population2010`, `Population2013`).
- `least-squares.RData` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/least-squares.RData)). Contains a single `data.frame`, named `leastSquares`, with 50 observations of the variables `x`, `yLin`, `yQua` and `yExp`. These are generated as $X\sim\mathcal{N}(0,1)$, $Y_\mathrm{lin}=-0.5+1.5X+\varepsilon$, $Y_\mathrm{qua}=-0.5+1.5X^2+\varepsilon$ and $Y_\mathrm{exp}=-0.5+1.5\cdot2^X+\varepsilon$, with $\varepsilon\sim\mathcal{N}(0,0.5^2)$. The purpose of the dataset is to illustrate the least squares fitting.
- `assumptions.RData` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/assumptions.RData)). Contains the data frame `assumptions` with 200 observations of the variables `x1`, ..., `x9` and `y1`, ..., `y9`. The purpose of the dataset is to identify which regression `y1 ~ x1`, ..., `y9 ~ x9` fulfills the assumptions of the linear model. The dataset `moreAssumptions.RData` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/moreAssumptions.RData)) has the same structure.
- `cpus.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/cpus.txt)) and `gpus.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/gpus.txt)). The datasets contain 102 and 35 rows, respectively, of commercial CPUs and GPUs appeared since the first models up to nowadays. The variables in the datasets are `Processor`, `Transistor count`, `Date of introduction`, `Manufacturer`, `Process` and `Area`.
- `hap.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/hap.txt)). Contains data for 20 advanced economies in the time period 1946–2009, measured for 31 variables. Among those, the variable `dRGDP` represents the real GDP growth (as a percentage) and `debtgdp` represents the percentage of public debt with respect to the GDP.
- `wine.csv` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/wine.csv)). The dataset is formed by the auction `Price` of 27 red Bordeaux vintages, five vintage descriptors (`WinterRain`, `AGST`, `HarvestRain`, `Age`, `Year`) and the population of France in the year of the vintage, `FrancePop`.
- `Boston.xlsx` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/Boston.xlsx)). The dataset contains 14 variables describing 506 suburbs in Boston. Among those variables, `medv` is the median house value, `rm` is the average number of rooms per house and `crim` is the per capita crime rate. The full description is available in `?Boston`.
- `assumptions3D.RData` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/assumptions3D.RData)). Contains the data frame `assumptions3D` with 200 observations of the variables `x1.1`, ..., `x1.8`, `x2.1`, ..., `x2.8` and `y.1`, ..., `y.8`. The purpose of the dataset is to identify which regression `y.1 ~ x1.1 + x2.1`, ..., `y.8 ~ x1.8 + x2.8` fulfills the assumptions of the linear model.
- `challenger.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/challenger.txt)). Contains data for 23 Space-Shuttle launches. The data consists of 23 shuttle flights. There are 8 variables. Among them: `temp`, the temperature in Celsius degrees at the time of launch, and `fail.field` and `fail.nozzle`, indicators of whether there were an incidents in the O-rings of the field joints and nozzles of the solid rocket boosters.
- `eurojob.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/eurojob.txt)). Contains data for employment in 26 European countries. There are 9 variables, giving the percentage of employments in 9 sectors: `Agr` (Agriculture), `Min` (Mining), `Man` (Manufacture), `Pow` (Power), `Con` (Construction), `Ser` (Sevices), `Fin` (Finance), `Soc` (Social) and `Tra` (Transport).
- `Chile.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/Chile.txt)). Contains data for 2700 respondents on a survey for the voting intentions in the 1988 Chilean national plebiscite. There are 8 variables: `region`, `population`, `sex`, `age`, `education`, `income`, `statusquo` (scale of support for the status quo) and `vote`. `vote` is a factor with levels `A` (abstention), `N` (against Pinochet), `U` (undecided), `Y` (for Pinochet). Available in `R` through the package `car` and `data(Chile)`.
- `USArrests.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/USArrests.txt)). Arrest statistics for `Assault`, `Murder` and `Rape` in each of the 50 US states in 1973. The percent of the population living in urban areas, `UrbanPop`, is also given. Available in `R` through `data(USArrests)`.
- `USJudgeRatings.txt` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/USJudgeRatings.txt)). Lawyers' ratings of state judges in the US Superior Court. The dataset contains 43 observations of 12 variables measuring the performance of the judge when conducting a trial. Available in `R` through `data(USJudgeRatings)`.
- `la-liga-2015-2016.xlsx` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/la-liga-2015-2016.xlsx)). Contains 19 performance metrics for the 20 football teams in La Liga 2015/2016.
- `pisaUS2009.csv` ([download](https://raw.githubusercontent.com/egarpor/SSS2-UC3M/master/datasets/pisaUS2009.csv)). Reading score of 3663 US students in the PISA test, with 23 variables informing about the student profile and family background.
## Main references and credits {#intro-credits}
The following great reference books have been used extensively for preparing these notes:
- @James2013 (linear regression, logistic regression, PCA, clustering),
- @Pena2002 (linear regression, logistic regression, PCA, clustering),
- @Bartholomew2008 (PCA).
The icons used in the notes were designed by [madebyoliver](http://www.flaticon.com/authors/madebyoliver), [freepik](http://www.flaticon.com/authors/freepik), and [roundicons](http://www.flaticon.com/authors/roundicons) from [Flaticon](http://www.flaticon.com/).
In addition, these notes are possible due to the existence of these incredible pieces of software: @R-bookdown, @R-knitr, @R-rmarkdown, and @R-base.
All material in these notes is licensed under [CC BY-NC-SA 4.0](https://creativecommons.org/licenses/by-nc-sa/4.0/).