-
Notifications
You must be signed in to change notification settings - Fork 3
/
Copy pathstm_tutorial_code.Rmd
646 lines (447 loc) · 21.9 KB
/
stm_tutorial_code.Rmd
1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289
290
291
292
293
294
295
296
297
298
299
300
301
302
303
304
305
306
307
308
309
310
311
312
313
314
315
316
317
318
319
320
321
322
323
324
325
326
327
328
329
330
331
332
333
334
335
336
337
338
339
340
341
342
343
344
345
346
347
348
349
350
351
352
353
354
355
356
357
358
359
360
361
362
363
364
365
366
367
368
369
370
371
372
373
374
375
376
377
378
379
380
381
382
383
384
385
386
387
388
389
390
391
392
393
394
395
396
397
398
399
400
401
402
403
404
405
406
407
408
409
410
411
412
413
414
415
416
417
418
419
420
421
422
423
424
425
426
427
428
429
430
431
432
433
434
435
436
437
438
439
440
441
442
443
444
445
446
447
448
449
450
451
452
453
454
455
456
457
458
459
460
461
462
463
464
465
466
467
468
469
470
471
472
473
474
475
476
477
478
479
480
481
482
483
484
485
486
487
488
489
490
491
492
493
494
495
496
497
498
499
500
501
502
503
504
505
506
507
508
509
510
511
512
513
514
515
516
517
518
519
520
521
522
523
524
525
526
527
528
529
530
531
532
533
534
535
536
537
538
539
540
541
542
543
544
545
546
547
548
549
550
551
552
553
554
555
556
557
558
559
560
561
562
563
564
565
566
567
568
569
570
571
572
573
574
575
576
577
578
579
580
581
582
583
584
585
586
587
588
589
590
591
592
593
594
595
596
597
598
599
600
601
602
603
604
605
606
607
608
609
610
611
612
613
614
615
616
617
618
619
620
621
622
623
624
625
626
627
628
629
630
631
632
633
634
635
636
637
638
639
640
641
642
643
---
title: "Structural topic models for enriching quantitative text analysis<br><small style=margin-top:1cm;>material available at https://github.com/cbpuschmann/stm_ic2s2</small>"
author: "Cornelius Puschmann & Carsten Schwemmer"
output:
revealjs::revealjs_presentation:
mathjax: null
center: yes
fig_height: 3.5
fig_width: 7
reveal_options:
previewLinks: yes
slideNumber: yes
theme: default
transition: fade
html_document:
highlight: tango
code_folding: show
date: "July 17, 2019"
---
*A huge thanks to Brandon Stewart for maintaining the STM package and for providing some of the content which we use in this tutorial.*
# Setup
## Prepare your R environment
- Download and extract the tutorial content from our [github repository](https://github.com/cbpuschmann/stm_ic2s2)
- Open the R project file `stm_ic2s2.Rproj` in RStudio
## How R Markdown files work
- Code is placed inside code cells, documentation is placed outside of code cells.
- Create new code chunks with CTRL/CMD + ALT + I
- Use CTRL/CMD + SHIFT + ENTER to run entire code cell
- Use CTRL/CMD + ENTER to run selected code
```{r}
# this is a comment inside a code cell
2+3
5+2
```
## Install packages
- Please install the packages that we will need for the tutorial:
```{r, eval = FALSE}
install.packages(c('tidyverse', 'stm', 'stminsights',
'quanteda', 'rmarkdown'),
dependencies = TRUE)
```
## Loading data and packages
```{r , message=FALSE, warning=FALSE, results = 'hide'}
library(tidyverse)
library(stm)
library(stminsights)
library(quanteda)
library(lubridate)
theme_set(theme_light())
df <- read_csv('data/donors_choose_sample.csv')
```
## The data we use for the workshop
- We will be using a sample from a [Kaggle](https://www.kaggle.com/c/donorschoose-application-screening) Data Science for Good challenge
- [DonorsChoose.org](https://DonorsChoose.org) provided the data and hosts an online platform where teachers can post requests for resources and people can make donations to these projects
- The goal of the original challenge was to match previous donors with campaigns that would most likely inspire additional donations
- The dataset includes texts and context information which might help answer various questions in social science. A description of variables is available [here](https://www.kaggle.com/donorschoose/io/discussion/56030)
## What could we learn from this data?
Examples of questions we might ask:
- How has classroom technology use changed over time? How does it differ by geographic location and the age of students?
- How do the requests of schools in urban areas compare to those in rural areas?
- What predicts whether a project will be funded?
- How do the predictors of funding vary by geographic location? Or by economic status of the students?
- Do male and female teachers ask for different resources? Are there differences in the way that they ask for those resources?
# Formal background of topic models
## Formal background of topic models
See our slides
# Preprocessing and feature selection
## Preprocessing and feature selection
- Due to time constraints, we will not explain the following code for preprocesing and feature selection in detail.
- You can explore it during the open coding session or after the tutorial.
- Use `load("data/stm_donor.RData")` to load the readily processed R objects that we need for the tutorial.
## Inspecting the data structure
```{r}
glimpse(df)
```
## Text example for one donation request
```{r}
cat(df$project_essay[1])
```
## Preparing texts
We use a [regular expression](https://en.wikipedia.org/wiki/Regular_expression) to clean up the donation texts:
```{r}
df$project_essay <- str_replace_all(df$project_essay,
'<!--DONOTREMOVEESSAYDIVIDER-->', '\n\n')
```
As we will incorporate contextual information of documents in our STM models, we also need to preprocess other variables.
## Working with time stamps
First, we convert the time strings to a date format and then create a numerical variable, where the earliest date corresponds to 0. We do so because estimating STM effects doesn't play nicely with `date` variables.
```{r}
# CTRL/CMD + SHIFT + M for the pipe operator
df$date <- ymd(df$project_posted_date)
min_date <- min(df$date) %>%
as.numeric()
df$date_num <- as.numeric(df$date) - min_date
date_table <- df %>% arrange(date_num) %>% select(date, date_num)
head(date_table, 2)
```
## Example for recoding variables
We can generate a proxy for the gender of teachers by working with their name prefixes.
```{r}
df %>% count(teacher_prefix)
```
## Example for recoding variables
```{r}
df <- df %>% mutate(gender = case_when(
teacher_prefix %in% c('Mrs.', 'Ms.') ~ 'Female',
teacher_prefix == 'Mr.' ~ 'Male',
TRUE ~ 'Other/Non-binary')) # TRUE -> everything else
df %>% count(gender)
```
## Other interesting variables: metro type
```{r}
df %>% count(school_metro_type)
```
## Other interesting variables: resource type
```{r}
df %>% count(project_resource_category, sort = TRUE)
```
## Other interesting variables: children eligible for free lunch
```{r}
df %>% ggplot(aes(x = school_percentage_free_lunch)) +
geom_histogram(bins = 20)
```
## Text analysis using quanteda
![](https://avatars2.githubusercontent.com/u/34347233?s=200&v=4){width=150px}
- A variety of R packages supports quantitative text analyses. We will focus on [quanteda](https://quanteda.io/), which is created and maintained by social scientists behind the Quanteda Initiative
- Besides offering a huge number of methods for preprocessing and analysis, it also includes a function to prepare our textual data for structural topic modeling
## Quanteda corpus object
- Using `corpus()`, you can create a quanteda corpus from a character vector or a data frame, which automatically includes meta data as document variables
```{r}
donor_corp <- corpus(df, text_field = 'project_essay',
docid_field = 'project_id')
docvars(donor_corp)$text <- df$project_essay # we need unprocessed texts later
ndoc(donor_corp) # no. of documents
```
## KWIC
- Before tokenization, corpus objects be used to discover [keywords in context](https://en.wikipedia.org/wiki/Key_Word_in_Context) (KWIC):
```{r}
kwic_donor <- kwic(donor_corp, pattern = c("ipad"),
window = 5) # context window
head(kwic_donor, 3)
```
## Tokenization
- Tokens can be created from a corpus or character vector. The documentation (`?tokens()`) illustrates several options, e.g. for the removal of punctuation
```{r}
donor_tokens <- tokens(donor_corp)
donor_tokens[[1]][1:20]
```
## Basic form of tokens
- After tokenization text, some terms with similar semantic meaning might be regarded as different features (e.g. `love`, `loving`)
- One solution is the application of [stemming](https://en.wikipedia.org/wiki/Stemming), which tries to reduce words to their basic form:
```{r}
words <- c("love", "loving", "lovingly", "loved", "lover", "lovely")
char_wordstem(words, 'english')
```
## To stem or not to stem?
- In the context of topic modeling, a [recent study](https://www.transacl.org/ojs/index.php/tacl/article/view/868/196) suggests that stemmers produce no meaningful improvement (for the English language)
- Ultimately, whether stemming generates useful features or not varies by use case
- An alternative that we won't cover in this course is [lemmatization](https://en.wikipedia.org/wiki/Lemmatisation), available via packages likes [spacyr](https://cran.r-project.org/web/packages/spacyr/index.html) and [udpipe](https://cran.r-project.org/web/packages/udpipe/index.html)
## More preprocessing
- Multiple preprocessing steps can be chained via the pipe operator, e.g normalizing to lowercase and removing common English stopwords:
```{r}
donor_tokens <- donor_tokens %>%
tokens_tolower() %>%
tokens_remove(stopwords('english'), padding = TRUE)
donor_tokens[[1]][1:10]
```
## Detecting collocations
- Collocations (phrases) are sequences of tokens which symbolize shared semantic meaning, e.g. `United States`
- Quanteda can detect collocations with log-linear models. An important parameter is the minimum collocation frequency, which can be used to fine-tune results
```{r}
colls <- textstat_collocations(donor_tokens,
min_count = 200) # minimum frequency
donor_tokens <- tokens_compound(donor_tokens, colls, join = FALSE) %>%
tokens_remove('') # remove empty strings
donor_tokens[[1]][1:5]
```
## Document-Feature Matrix (DFM)
- Most models for automated text analysis require matrices as input format
- A common variant which directrly translates to the bag of words format is the [document term matrix](https://en.wikipedia.org/wiki/Document-term_matrix) (in quanteda: document-feature matrix):
doc_id I like hate currywurst
-------- --- ------ ------ -------------
1 1 1 0 1
2 1 0 1 1
## Creating a Document-Feature Matrix (dfm)
- Problem: textual data is highly dimensional -> dfms's potentially grow to millions of rows & columns -> matrices for large text corpora don't fit in memory
- Features are not evenly distributed (see e.g. [Zipf's law](https://en.wikipedia.org/wiki/Zipf%27s_law)) and most of these cells contain zeroes
- Solution: Sparse data format, which does not include zero counts. Quanteda natively implements DFM's as [sparse matrices](https://en.wikipedia.org/wiki/Sparse_matrix)
## DFM's in quanteda
- Quanteda can create DFM's from character vectors, corpora and token objects
- Preprocessing that does not need to account for word order can also be done during or after the creation of DFM's (see documentation for `tokens()`)
```{r}
dfm_donor <- dfm(donor_tokens, remove_numbers = TRUE)
dim(dfm_donor)
```
## More preprocessing - feature trimming
- As an alternative (or complement) to manually defining stopwords, terms occuring in a large proportion of documents can be removed automatically. Rationale: if almost every document includes a term, it is not a useful feature for categorization
- Very rare terms are often removed, as they are also not very helpful for categorization and can lead to overfitting
```{r}
dfm_donor <- dfm_donor %>%
dfm_keep(min_nchar = 2) %>% # remove chars with only one character
dfm_trim(min_docfreq = 0.002, max_docfreq = 0.50, #2% min, 50% max
docfreq_type = 'prop') # proportions instead of counts
dim(dfm_donor)
```
## Prepare textual data for STM
- You can provide input data for the stm package in several ways:
- via STM's own functions for text pre-processing
- via directly passing quanteda dfm's
- using quanteda's `convert()` function to prepare dfm's (recommended option)
```{r}
out <- convert(dfm_donor, to = 'stm')
names(out)
```
# Introducing structural topic models and tuning parameters
## Introducing structural topic models and tuning parameters
See our slides
## STM - model fitting
- For our first model, we will choose 30 topics and include school metro type, teacher gender and a flexible [spline](https://en.wikipedia.org/wiki/Spline_(mathematics)) for date as prevalence covariates:
```{r, eval = FALSE}
stm_30 <- stm(documents = out$documents,
vocab = out$vocab,
data = out$meta,
K = 30,
prevalence = ~ school_metro_type + gender + s(date_num),
verbose = TRUE) # show progress
stm_effects30 <- estimateEffect(1:30 ~ school_metro_type +
gender + s(date_num),
stmobj = stm_30, metadata = out$meta)
```
## Saving and restoring models
- Depending on the number of documents and the vocabulary size, fitting STM models can require a lot of memory and computation time
- It can be useful to save model objects as R binaries and reload them as needed:
```{r, eval = FALSE }
save(out, stm_30, stm_effects30, file = "data/stm_donor.RData")
```
```{r}
load("data/stm_donor.RData") # reload data
```
# Model validation and interactively exploring STM models
## Interpreting structural topic models - topic proportions
- `plot.STM()` implements several options for model interpretation.
- *summary* plots show proportions and the most likely terms for each topic:
## Model interpretation - topic proportions
```{r fig.height=5, fig.width=9}
plot.STM(stm_30, type = 'summary', text.cex = 0.8)
```
## Model interpretation - probability terms
`label` plots show terms for each topic with (again) the most likely terms as a default:
## Model interpretation - probability terms
```{r fig.height=4, fig.width=7}
plot.STM(stm_30, type = 'labels', n = 8,
text.cex = 0.8, width = 100, topics = 1:5)
```
## Model interpretation - frex terms
One strength of STM is that it also offers other metrics for topic terms. `frex` terms are both frequent and exclusive to a topic.
## Model interpretation - frex terms
```{r fig.height=4, fig.width=7}
plot.STM(stm_30, type = 'labels', n = 8, text.cex = 0.8,
width = 100, topics = 1:5, labeltype = 'frex')
```
## Model interpretation - don't rely on terms only
- Assigning labels for topics only by looking at the most likely terms is generally not a good idea
- Sometimes these terms contain domain-specific stop words. Sometimes they are hard to make sense of by themselves
- Recommendation:
- use probability (most likely) terms
- use frex terms
- **qualitatively examine representative documents**
## Model interpretation - representative documents
- STM allows to find representative (unprocessed) documents for each topic with `findThoughts()`, which can then be plotted with `plotQuote()`:
```{r }
thoughts <- findThoughts(stm_30,
texts = out$meta$text, # unprocessed documents
topics = 1:3, n = 2) # topics and number of documents
```
## Model interpretation - representative documents
```{r fig.height=5, fig.width=9}
plotQuote(thoughts$docs[[3]][1], # topic 3
width = 80, text.cex = 0.75)
```
## Model intepretation - perspective plot
- It is also possible to visualize differences in word usage between two topics:
```{r fig.height=5, fig.width=8}
plot.STM(stm_30, type = 'perspective', topics = c(2,3))
```
## Interactive model validation - stminsights
- You can interactively validate and explore structural topic models using the R package *stminsights*. What you need:
- one or several stm models and corresponding effect estimates
- the `out` object used to fit the models which includes documents, vocabulary and meta-data
- The example `stm_donor.RData` includes all required objects
```{r eval=FALSE, message=FALSE, warning=FALSE}
run_stminsights()
```
# Interpreting and visualizing prevalence and content effects
- You already estimated a model with prevalence effects. Now we'll see how to also estimate content effects and how to visualize prevalence and content effects
- There are several options for interpreting and visualizing effects:
- using functions of the STM package
- using stminsights function `get_effects()`
- usting stminsights interactive mode
## Prevalence effects (stm package)
## Options for visualizing prevalence effects
- Prevalence covariates affect topic proportions
- They can be visualized in three ways:
- `pointestimate`: pointestimates for categorical variables
- `difference`: differences between topic proportions for two categories of one variable
- `continuous`: line plots for continuous variables
- You can also visualize interaction effects if you integrated them in your STM model (see `?plot.estimateEffect()`)
## Prevalence effects - pointestimate
```{r}
plot.estimateEffect(stm_effects30, topic = 3,
covariate = 'school_metro_type', method = 'pointestimate')
```
## Prevalence effects - difference
```{r}
plot.estimateEffect(stm_effects30, covariate = "gender",
topics = c(5:10), method = "difference",
model = stm_30, # to show labels alongside
cov.value1 = "Female", cov.value2 = "Male",
xlab = "Male <---> Female", xlim = c(-0.08, 0.08),
labeltype = "frex", n = 3,
width = 100, verbose.labels = FALSE)
```
## Prevalence effects - continuous
```{r}
plot.estimateEffect(stm_effects30, covariate = "date_num",
topics = c(9:10), method = "continuous")
```
## Prevalence effects with stminsights
- You can use `get_effects()` to store prevalence effects in a tidy data frame:
```{r}
gender_effects <- get_effects(estimates = stm_effects30,
variable = 'gender',
type = 'pointestimate')
date_effects <- get_effects(estimates = stm_effects30,
variable = 'date_num',
type = 'continuous')
```
- Afterwards, effects can for instance be visualized with `ggplot2`
## Prevalence effects with stminsights - categorical
```{r}
gender_effects %>% filter(topic == 3) %>%
ggplot(aes(x = value, y = proportion)) + geom_point() +
geom_errorbar(aes(ymin = lower, ymax = upper), width = 0.1) +
coord_flip() + labs(x = 'Gender', y = 'Topic Proportion')
```
## Prevalence effects with stminsights - continuous (date)
- STM doesn't work well with visualzing continous date variables.
- For visualization purposes, we can convert our numeric date identifier back to original form:
```{r message=FALSE, warning=FALSE}
date_effects <- date_effects %>%
mutate(date_num = round(value, 0)) %>%
left_join(out$meta %>% select(date, date_num)) %>% distinct()
```
## Prevalence effects with stminsights - continuous (date)
```{r fig.height=3, fig.width=7}
date_effects %>% filter(topic %in% c(9,10)) %>%
ggplot(aes(x = date, y = proportion,
group = topic, color = topic, fill = topic)) +
geom_line() + scale_x_date(date_break = '3 months', date_labels = "%b/%y") +
geom_ribbon(aes(ymin = lower, ymax = upper), alpha = 0.2) +
theme(axis.text.x = element_text(angle = 45, hjust = 1))
```
## STM content effects
- Content effects allow covariates to affect word distributions **within a topic** (e.g. female teachers talk differently about sports in comparison to male teachers). Example model formula: ``content = ~ gender``
- This feature is powerful but comes with some disadvantages:
- You can only use one discrete variable for content effects
- Interpreting the model is more complicated (see `labelTopics()` and `sageLabels()`)
- We will focus on visualizing content effects with `perspective` plots
## Fitting content models
- Content effects can (but do not have to) be combined with prevalence effects. We fit a model with 20 topics and teacher gender as content covariate
- Important note: this as a new model and can show different results, even if you compare it to a model with the same number of topics
```{r, eval = FALSE}
stm_20_content <- stm(documents = out$documents,
vocab = out$vocab,
data = out$meta,
K = 20,
prevalence = ~ school_metro_type + gender + s(date_num),
content = ~ gender,
verbose = FALSE) # show progress
stm_effects20 <- estimateEffect(1:20 ~ school_metro_type +
gender + s(date_num),
stmobj = stm_20_content, metadata = out$meta)
save(stm_20_content, stm_effects20,file = "data/stm_donor_content.RData")
```
## Load content model
```{r}
load("data/stm_donor_content.RData") # reload data
```
## Visalizing content effects
```{r fig.height=6, fig.width=8}
plot.STM(stm_20_content, topics = c(2), type = 'perspectives',
covarlevels = c('Female', 'Male'))
```
# Open coding session
## Open coding session - your turn
- For the open coding session, you can choose to either play around with data and models from the tutorial, or try fitting stm models on your own data
- We will be around to help you out and answer questions
# Appendix - code to play around with
## Appendix - more about comparing models
As for statistical diagnostics, the STM authors recommend to inspect semantic coherence and exclusivity (see [STM vignette](https://github.com/bstewart/stm/blob/master/inst/doc/stmVignette.pdf?raw=true)):
- Semantic coherence is is maximized when the most probable words in a given topic frequently co-occur together
- Exclusivity (FREX) is maximized when a topic includes many exclusive terms
- Coherence and exclusivity cannot be compared for models with content effects
## Appendix - fitting another model for comparisons
```{r, eval = FALSE}
stm_10 <- stm(documents = out$documents,
vocab = out$vocab,
data = out$meta,
K = 10,
prevalence = ~ school_metro_type * s(date_num), # interaction
verbose = FALSE) # show progress
stm_effects10 <- estimateEffect(1:10 ~ school_metro_type +
gender * s(date_num),
stmobj = stm_10, metadata = out$meta)
save(stm_10, stm_effects10, file = "data/stm_donor_int.RData")
```
```{r}
load("data/stm_donor_int.RData") # reload data
```
## Appendix - calculating diagnostics (stminsights)
```{r}
diag <- get_diag(models = list(
model10 = stm_10, model30 = stm_30), out)
diag %>%
ggplot(aes(x = coherence, y = exclusivity, color = statistic)) +
geom_text(aes(label = name), nudge_x = 2) + geom_point() +
labs(x = 'Semantic Coherence', y = 'Exclusivity')
```
## Appendix - correlation networks (stminsights)
```{r, message=FALSE, warning=FALSE}
library(ggraph)
stm_corrs <- get_network(model = stm_30,
method = 'simple', # correlation criterion,
cutoff = 0.05, # minimum correlation
labels = paste('T', 1:30),
cutiso = FALSE) # isolated nodes
```
## Appendix - correlation networks (stminsights)
```{r fig.height=5, fig.width=9, message=FALSE, warning=FALSE}
ggraph(stm_corrs, layout = 'fr') + geom_edge_link(
aes(edge_width = weight), label_colour = '#fc8d62',
edge_colour = '#377eb8') + geom_node_point(size = 4, colour = 'black') +
geom_node_label(aes(label = name, size = props),
colour = 'black', repel = TRUE, alpha = 0.85) +
scale_size(range = c(2, 10), labels = scales::percent) +
labs(size = 'Topic Proportion', edge_width = 'Topic Correlation') + theme_graph()
```