Skip to content

Latest commit

 

History

History
1972 lines (1754 loc) · 72.1 KB

README.md

File metadata and controls

1972 lines (1754 loc) · 72.1 KB

termco

Project Status: Active - The project has reached a stable, usable state and is being actively developed. Build Status Coverage Status DOI

termco is a suite of functions used to count and find terms and substrings in strings. The tools can be used to build an expert rules, regular expression based text classification model. The package wraps the data.table and stringi packages to create fast data frame counts of regular expression terms and substrings.

Table of Contents

Functions

The main function of termco is term_count. It is used to extract regex term counts by grouping variable(s) as well as to generate classification models.

Most of the functions count, search, plot terms, and covert between output types, while a few remaining functions are used to train, test and interpret models. Additionally, the probe_ family of function generate lists of function calls or plots for given search terms. The table below describes the functions, category of use, and their description:

Function Use Category Description
term_count count Count regex term occurrence; modeling
token_count count Count fixed token occurrence; modeling
frequent_terms/all_words count Frequent terms
important_terms count Important terms
hierarchical_coverage_term count Unique coverage of a text vector by terms
hierarchical_coverage_regex count Unique coverage of a text vector by regex
frequent_ngrams count Weighted frequent ngram (2 & 3) collocations
word_count count Count words
term_before/term_after count Frequency of words before/after a regex term
term_first count Frequency of words at the begining of strings
colo search Regex output to find term collocations
search_term search Search for regex terms
match_word search Extract words from a text matching a regular expression
search_term_collocations search Wrapper for search_term + frequent_terms
classification_project modeling Make a classification modeling project template
classification_template modeling Make a classification analysis script template
as_dtm/as_tdm modeling Coerce term_count object into tm::DocumentTermMatrix/tm::TermDocumentMatrix
split_data modeling Split data into train & test sets
evaluate modeling Check accuracy of model against human coder
classify modeling Assign n tags to text from a model
get_text modeling Get the original text for model tags
coverage modeling Coverage for term_count or search_term object
uncovered/get_uncovered modeling Get the uncovered text from a model
mutate_counts modeling Apply normalizing function to term count columns
select_counts modeling Select columns without stripping count classes
tag_co_occurrence modeling Explore co-occurrence of tags from a model
validate_model/assign_validation_task modeling Human validation of a term_count model
read_term_list read/write Read a term list from an external file
write_term_list read/write Write a term list to an external file
term_list_template read/write Write a term list template to an external file
as_count convert Strip pretty printing from term_count object
as_terms convert Convert a count matrix to list of term vectors
as_term_list convert Convert a vector of terms into a named term list
weight convert Weight a term_count object proportion/percent
plot_ca plot Plot term_count object as 3-D correspondence analysis map
plot_counts plot Horizontal bar plot of group counts
plot_freq plot Vertical bar plot of frequencies of counts
plot_cum_percent plot Plot frequent_terms object as cumulative percent
probe_list probe Generate list of search_term function calls
probe_colo_list probe Generate list of search_term_collocations function calls
probe_colo_plot_list probe Generate list of search_term_collocationss + plot function calls
probe_colo_plot probe Plot probe_colo_plot_list directly

Installation

To download the development version of termco:

Download the zip ball or tar ball, decompress and run R CMD INSTALL on it, or use the pacman package to install the development version:

if (!require("pacman")) install.packages("pacman")
pacman::p_load_gh(
    "trinker/gofastr",
    "trinker/termco"
)

Contact

You are welcome to:

Examples

The following examples demonstrate some of the functionality of termco.

Load the Tools/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)

data(presidential_debates_2012)

Build Counts Dataframe

discoure_markers <- list(
    response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
    back_channels = c("uh[- ]huh", "uhuh", "yeah"),
    summons = "hey",
    justification = "because"
)

counts <- presidential_debates_2012 %>%
    with(term_count(dialogue, grouping.var = list(person, time), discoure_markers))

counts

## Coverage: 100% 
## # A tibble: 10 x 7
##    person  time  n.words response_cries back_channels summons justification
##    <fct>   <fct>   <int> <chr>          <chr>         <chr>   <chr>        
##  1 OBAMA   time~    3599 3(.08%)        0             43(1.1~ 26(.72%)     
##  2 OBAMA   time~    7477 2(.03%)        0             42(.56~ 29(.39%)     
##  3 OBAMA   time~    7243 1(.01%)        1(.01%)       58(.80~ 33(.46%)     
##  4 ROMNEY  time~    4085 0              0             27(.66~ 8(.20%)      
##  5 ROMNEY  time~    7536 1(.01%)        3(.04%)       49(.65~ 20(.27%)     
##  6 ROMNEY  time~    8303 5(.06%)        0             84(1.0~ 19(.23%)     
##  7 CROWLEY time~    1672 2(.12%)        0             4(.24%) 12(.72%)     
##  8 LEHRER  time~     765 3(.39%)        3(.39%)       0       0            
##  9 QUESTI~ time~     583 2(.34%)        0             0       2(.34%)      
## 10 SCHIEF~ time~    1445 0              0             2(.14%) 6(.42%)

Printing

print(counts, pretty = FALSE)

## Coverage: 100% 
## # A tibble: 10 x 7
##    person  time  n.words response_cries back_channels summons justification
##    <fct>   <fct>   <int>          <int>         <int>   <int>         <int>
##  1 OBAMA   time~    3599              3             0      43            26
##  2 OBAMA   time~    7477              2             0      42            29
##  3 OBAMA   time~    7243              1             1      58            33
##  4 ROMNEY  time~    4085              0             0      27             8
##  5 ROMNEY  time~    7536              1             3      49            20
##  6 ROMNEY  time~    8303              5             0      84            19
##  7 CROWLEY time~    1672              2             0       4            12
##  8 LEHRER  time~     765              3             3       0             0
##  9 QUESTI~ time~     583              2             0       0             2
## 10 SCHIEF~ time~    1445              0             0       2             6

print(counts, zero.replace = "_")

## Coverage: 100% 
## # A tibble: 10 x 7
##    person  time  n.words response_cries back_channels summons justification
##    <fct>   <fct>   <int> <chr>          <chr>         <chr>   <chr>        
##  1 OBAMA   time~    3599 3(.08%)        _             43(1.1~ 26(.72%)     
##  2 OBAMA   time~    7477 2(.03%)        _             42(.56~ 29(.39%)     
##  3 OBAMA   time~    7243 1(.01%)        1(.01%)       58(.80~ 33(.46%)     
##  4 ROMNEY  time~    4085 _              _             27(.66~ 8(.20%)      
##  5 ROMNEY  time~    7536 1(.01%)        3(.04%)       49(.65~ 20(.27%)     
##  6 ROMNEY  time~    8303 5(.06%)        _             84(1.0~ 19(.23%)     
##  7 CROWLEY time~    1672 2(.12%)        _             4(.24%) 12(.72%)     
##  8 LEHRER  time~     765 3(.39%)        3(.39%)       _       _            
##  9 QUESTI~ time~     583 2(.34%)        _             _       2(.34%)      
## 10 SCHIEF~ time~    1445 _              _             2(.14%) 6(.42%)

Plotting

plot(counts)

plot(counts, labels=TRUE)

plot_ca(counts, FALSE)

Ngram Collocations

termco wraps the quanteda package to examine important ngram collocations. quanteda’s collocation function provides measures of: "lambda", "z", and "frequency" to examine the strength of relationship between ngrams. termco adds stopword removal, min/max character filtering, and stemming to quanteda’s collocation as well as a generic plot method.

x <- presidential_debates_2012[["dialogue"]]

frequent_ngrams(x)

##            collocation length frequency count_nested    lambda         z
##  1:          make sure      2       127          127  7.554897 32.834995
##  2:    governor romney      2       105          104  9.271292 20.461487
##  3:         four years      2        63           63  7.338151 28.204976
##  4:   mister president      2        61           51  7.834853 19.748190
##  5:      united states      2        31           31  9.795398 17.356448
##  6:       middle class      2        30           30  8.777654 16.614018
##  7:          last four      2        27           27  6.115321 21.912251
##  8:    last four years      3        27            0 -1.566379 -1.028654
##  9:        health care      2        26           26  8.227977 20.429621
## 10:    american people      2        26           26  5.120440 19.048883
## 11:        middle east      2        26           26 10.742379  7.485044
## 12:   small businesses      2        22           22  7.762762 20.244536
## 13:        making sure      2        19           19  5.356647 17.260131
## 14:     million people      2        17           17  4.780434 15.493120
## 15: federal government      2        15           15  6.507298 17.346209
## 16:       young people      2        15           15  5.624489 14.208840
## 17:         dodd frank      2        15           15 14.718342  7.300509
## 18:     small business      2        13           13  7.102122 17.040580
## 19:      middle income      2        13           13  6.871943 15.504096
## 20:  governor romney's      2        13           13  8.786176  6.091802

frequent_ngrams(x, gram.length = 3)

## Warning in evalq(as.data.frame(list(collocation = c("make sure our", "the
## reason is", : restarting interrupted promise evaluation

##                 collocation length frequency count_nested     lambda
##  1:         last four years      3        27            0 -1.5663795
##  2:    twenty three million      3        11            0  4.1864145
##  3:   thousand nine hundred      3        11            0 -0.3498540
##  4:   middle class families      3        10            0 -4.2845959
##  5:   thousand five hundred      3         8            0 -1.1540324
##  6:    governor romney says      3         8            0 -4.0173427
##  7:    three million people      3         6            0 -0.1340563
##  8:         next four years      3         6            0 -0.8781626
##  9:    governor romney said      3         6            0 -3.3551066
## 10:  middle income families      3         6            0 -3.9239765
## 11:       five million jobs      3         5            0  2.1884488
## 12:         five point plan      3         5            0  2.7135106
## 13:   seven hundred sixteen      3         5            0  0.4756980
## 14: hundred sixteen billion      3         5            0  0.4265593
## 15:    dollar seven hundred      3         5            0 -0.6058730
## 16:    dollar five trillion      3         5            0 -0.8280113
## 17:       four years closer      3         5            0 -3.0979869
## 18:     forty seven million      3         4            0  0.7245824
## 19:   best education system      3         4            0  0.2615127
## 20:        rising take home      3         4            0 -0.1449834
##               z
##  1: -1.02865393
##  2:  1.68276560
##  3: -0.16798792
##  4: -2.63216754
##  5: -0.70519267
##  6: -2.26634962
##  7: -0.08481962
##  8: -0.42487275
##  9: -2.02007252
## 10: -2.43709086
## 11:  1.22412938
## 12:  1.22017842
## 13:  0.21031506
## 14:  0.16642917
## 15: -0.36134012
## 16: -0.39457112
## 17: -1.44984909
## 18:  0.32149800
## 19:  0.11938982
## 20: -0.06723986

frequent_ngrams(x, order.by = "lambda")

##                 collocation length frequency count_nested   lambda
##  1:              dodd frank      2        15           15 14.71834
##  2:         standard bearer      2         4            4 13.48186
##  3:   intellectual property      2         3            3 13.23057
##  4:            joint chiefs      2         3            3 13.23057
##  5:            apology tour      2         3            3 13.23057
##  6:           onest century      2         3            3 13.23057
##  7:             wall street      2         9            9 13.13031
##  8:              boca raton      2         2            2 12.89412
##  9:         abraham lincoln      2         2            2 12.89412
## 10:           raton florida      2         2            2 12.89412
## 11: unintended consequences      2         2            2 12.89412
## 12:         haqqani network      2         2            2 12.89412
## 13:      permanent resident      2         2            2 12.89412
## 14:      appleton wisconsin      2         2            2 12.89412
## 15:          prime minister      2         2            2 12.89412
## 16:             food stamps      2         9            9 12.61946
## 17:      planned parenthood      2         5            5 12.58386
## 18:        self deportation      2         4            4 12.38322
## 19:        cleveland clinic      2         3            3 12.13193
## 20:             rose garden      2         3            3 12.13193
##            z
##  1: 7.300509
##  2: 6.561118
##  3: 6.390952
##  4: 6.390952
##  5: 6.390952
##  6: 6.390952
##  7: 7.886454
##  8: 6.147013
##  9: 6.147013
## 10: 6.147013
## 11: 6.147013
## 12: 6.147013
## 13: 6.147013
## 14: 6.147013
## 15: 6.147013
## 16: 7.972817
## 17: 7.455987
## 18: 7.285615
## 19: 7.060603
## 20: 7.060603

Collocation Plotting

plot(frequent_ngrams(x))

plot(frequent_ngrams(x), drop.redundant.yaxis.text = FALSE)

plot(frequent_ngrams(x, gram.length = 3))

plot(frequent_ngrams(x, order.by = "lambda"))

Converting to Document Term Matrix

Regular expression counts can be useful features in machine learning models. The tm package’s DocumentTermMatrix is a popular data structure for machine learning in R. The as_dtm and as_tdm functions are useful for coercing the count data.table structure of a term_count object into a DocumentTermMatrix/TermDocumentMatrix. The result can be combined with token/word only DocumentTermMatrix structures using cbind & rbind.

as_dtm(markers)

## <<DocumentTermMatrix (documents: 10, terms: 4)>>
## Non-/sparse entries: 21/19
## Sparsity           : 48%
## Maximal term length: 14
## Weighting          : term frequency (tf)

cosine_distance <- function (x, ...) {
    x <- t(slam::as.simple_triplet_matrix(x))
    stats::as.dist(1 - slam::crossprod_simple_triplet_matrix(x)/(sqrt(slam::col_sums(x^2) %*% 
        t(slam::col_sums(x^2)))))
}


mod <- hclust(cosine_distance(as_dtm(markers)))
plot(mod)
rect.hclust(mod, k = 5, border = "red")

(clusters <- cutree(mod, 5))

##     OBAMA.time 1     OBAMA.time 2     OBAMA.time 3    ROMNEY.time 1 
##                1                1                1                1 
##    ROMNEY.time 2    ROMNEY.time 3   CROWLEY.time 2    LEHRER.time 1 
##                2                3                3                4 
##  QUESTION.time 2 SCHIEFFER.time 3 
##                5                1

Building an Expert Rules, Regex Classifier Model

Machine learning models of classification are great when you have known tags to train with because the model scales. Qualitative, expert based human coding is terrific for when you have no tagged data. However, when you have a larger, untagged data set the machine learning approaches have no outcome to learn from and the data is too large to classify by hand. One solution is to use a expert rules, regular expression approach that is somewhere between machine learning and hand coding. This is one solution for tagging larger, untagged data sets. Additionally, when each text element contains larger chunks of text, unsupervised clustering type algorithms such as k-means, non-negative matrix factorization, hierarchical clustering, or topic modeling may be of use for creating clusters that could be interpreted and treated as categories.

This example section highlights the types of function combinations and order for a typical expert rules classification. This task typically involves the combined use of available literature, close examinations of term usage within text, and researcher experience. Building a classifier model requires the researcher to build a list of regular expressions that map to a category or tag. Below I outline minimal work flow for classification.

Note that the user may want to begin with a classification model template that contains subdirectories and files for a classification project. The classification_project generates this template with a pre-populated ‘classification.R’ script that can guide the user through the modeling process. The directory tree looks like the following:

template
    |
    |   .Rproj
    |   
    +---models
    |       categories.R
    |       
    +---data
    +---output
    +---plots
    +---reports
    \---scripts
            01_data_cleaning.R
            02_classification.R

Load the Tools/Data

if (!require("pacman")) install.packages("pacman")
pacman::p_load(dplyr, ggplot2, termco)

data(presidential_debates_2012)

Splitting Data

Many classification techniques require the data to be split into a training and test set to allow the researcher to observe how a model will perform on a new data set. This also prevents over-fitting the data. The split_data function allows easy splitting of data.frame or vector data by integer or proportion. The function returns a named list of the data set into a train and test set. The printed view is a truncated version of the returned list with |... indicating there are additional observations.

set.seed(111)
(pres_deb_split <- split_data(presidential_debates_2012, .75))

## split_data:
## 
## train: n = 2184
## # A tibble: 6 x 5
##   person   tot    time  role    dialogue                                   
##   <fct>    <chr>  <fct> <fct>   <chr>                                      
## 1 CROWLEY  230.2  time~ modera~ Governor Romney?                           
## 2 SCHIEFF~ 48.1   time~ modera~ you're going to get a chance to respond to~
## 3 ROMNEY   98.15  time~ candid~ Let's have a flexible schedule so you can ~
## 4 ROMNEY   173.12 time~ candid~ But I find more troubling than this, that ~
## 5 OBAMA    102.6  time~ candid~ You know a major difference in this campai~
## 6 OBAMA    120.16 time~ candid~ Making sure that we are controlling our ow~
## |...
## 
## test: n = 728
## # A tibble: 6 x 5
##   person tot   time   role    dialogue                                     
##   <fct>  <chr> <fct>  <fct>   <chr>                                        
## 1 LEHRER 1.1   time 1 modera~ We'll talk about specifically about health c~
## 2 ROMNEY 2.2   time 1 candid~ And the president supports taking dollar sev~
## 3 ROMNEY 4.4   time 1 candid~ They get to choose and they'll have at least~
## 4 ROMNEY 4.5   time 1 candid~ So they don't have to pay additional money, ~
## 5 ROMNEY 4.7   time 1 candid~ They'll have at least two plans.             
## 6 ROMNEY 4.17  time 1 candid~ That's the plan that I've put forward.       
## |...

The training set can be accessed via pres_deb_split$train; likewise, the test set can be accessed by way of pres_deb_split$test.

Here I show splitting by integer.

split_data(presidential_debates_2012, 100)

## split_data:
## 
## train: n = 100
## # A tibble: 6 x 5
##   person tot    time   role    dialogue                                    
##   <fct>  <chr>  <fct>  <fct>   <chr>                                       
## 1 OBAMA  102.4  time 2 candid~ Now, there are some other issues that have ~
## 2 ROMNEY 122.26 time 3 candid~ I've watched year in and year out as compan~
## 3 ROMNEY 166.16 time 3 candid~ The president's path will mean continuing d~
## 4 ROMNEY 162.18 time 3 candid~ Look, I love to I love teachers, and I'm ha~
## 5 OBAMA  20.3   time 2 candid~ We have increased oil production to the hig~
## 6 ROMNEY 59.12  time 1 candid~ Anybody can have deductions up to that amou~
## |...
## 
## test: n = 2812
## # A tibble: 6 x 5
##   person tot   time   role    dialogue                                     
##   <fct>  <chr> <fct>  <fct>   <chr>                                        
## 1 LEHRER 1.1   time 1 modera~ We'll talk about specifically about health c~
## 2 LEHRER 1.2   time 1 modera~ But what do you support the voucher system, ~
## 3 ROMNEY 2.1   time 1 candid~ What I support is no change for current reti~
## 4 ROMNEY 2.2   time 1 candid~ And the president supports taking dollar sev~
## 5 LEHRER 3.1   time 1 modera~ And what about the vouchers?                 
## 6 ROMNEY 4.1   time 1 candid~ So that's that's number one.                 
## |...

I could have trained on the training set and tested on the testing set in the following examples around modeling but have chosen not to for simplicity.

Understanding Term Use

In order to build the named list of regular expressions that map to a category/tag the researcher must understand the terms (particularly information salient terms) in context. The understanding of term use helps the researcher to begin to build a mental model of the topics being used in a fashion similar to qualitative coding techniques. Broad categories will begin to coalesce as word use is elucidated. It forms the initial names of the “named list of regular expressions”. Of course building the regular expressions in the regex model building step will allow the researcher to see new ways in which terms are used as well as new important terms. This in turn will reshape, remove, and add names to the “named list of regular expressions”. This recursive process is captured in the model below.

model

View Most Used Words

A common task in building a model is to understand the most frequent words while excluding less information rich function words. The frequnt_terms function produces an ordered data frame of counts. The researcher can exclude stop words and limit the terms to contain n characters between set thresholds. The output is ordered by most to least frequent n terms but can be rearranged alphabetically.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue))

##    term      frequency
## 1  going     271      
## 2  make      217      
## 3  people    214      
## 4  governor  204      
## 5  president 194      
## 6  said      178      
## 7  want      173      
## 8  sure      156      
## 9  just      134      
## 10 years     118      
## 11 jobs      116      
## 12 romney    110      
## 13 also      102      
## 14 know       97      
## 15 four       94      
## 16 world      92      
## 17 well       91      
## 18 right      88      
## 19 think      88      
## 20 america    87

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 40)) %>%
    plot()

A cumulative percent can give a different view of the term usage. The plot_cum_percent function converts a frequent_terms output into a cumulative percent plot. Additionally, frequent_ngrams + plot can give insight into the frequently occurring ngrams.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 40)) %>%
    plot_cum_percent()

It may also be helpful to view the unique contribution of terms on the coverage excluding all elements from the match vector that were previously matched by another term. The hierarchical_coverage_term and accompanying plot method allows for hierarchical exploration of the unique coverage of terms.

terms <- presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 30)) %>%
    `[[`("term")

presidential_debates_2012 %>%
    with(hierarchical_coverage_term(dialogue, terms))

##          term       unique cumulative
## 1       going 0.0834478022  0.0834478
## 2        make 0.0576923077  0.1411401
## 3      people 0.0515109890  0.1926511
## 4    governor 0.0583791209  0.2510302
## 5   president 0.0480769231  0.2991071
## 6        said 0.0295329670  0.3286401
## 7        want 0.0305631868  0.3592033
## 8        sure 0.0058379121  0.3650412
## 9        just 0.0223214286  0.3873626
## 10      years 0.0240384615  0.4114011
## 11       jobs 0.0171703297  0.4285714
## 12     romney 0.0003434066  0.4289148
## 13       also 0.0140796703  0.4429945
## 14       know 0.0113324176  0.4543269
## 15       four 0.0054945055  0.4598214
## 16      world 0.0130494505  0.4728709
## 17       well 0.0147664835  0.4876374
## 18      right 0.0161401099  0.5037775
## 19      think 0.0113324176  0.5151099
## 20    america 0.0113324176  0.5264423
## 21     number 0.0109890110  0.5374313
## 22       back 0.0058379121  0.5432692
## 23       need 0.0089285714  0.5521978
## 24      first 0.0065247253  0.5587225
## 25     middle 0.0061813187  0.5649038
## 26   thousand 0.0085851648  0.5734890
## 27       time 0.0085851648  0.5820742
## 28    economy 0.0078983516  0.5899725
## 29 government 0.0082417582  0.5982143
## 30       work 0.0068681319  0.6050824

presidential_debates_2012 %>%
    with(hierarchical_coverage_term(dialogue, terms)) %>%
    plot(use.terms = TRUE)

View Most Used Words in Context

Much of the exploration of terms in context in effort to build the named list of regular expressions that map to a category/tag involves recursive views of frequent terms in context. The probe family of functions can generate lists of function calls (and copy them to the clipboard for easy transfer) allowing the user to circulate through term lists generated from other termco tools such as frequent_terms. This is meant to standardize and speed up the process.

The first probe_ tool makes a list of function calls for search_term using a term list. Here I show just 10 terms from frequent_terms. This can be pasted into a script and then run line by line to explore the frequent terms in context.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 10)) %>%
    select(term) %>%
    unlist() %>%
    probe_list("presidential_debates_2012$dialogue") 

## search_term(presidential_debates_2012$dialogue, "going")
## search_term(presidential_debates_2012$dialogue, "make")
## search_term(presidential_debates_2012$dialogue, "people")
## search_term(presidential_debates_2012$dialogue, "governor")
## search_term(presidential_debates_2012$dialogue, "president")
## search_term(presidential_debates_2012$dialogue, "said")
## search_term(presidential_debates_2012$dialogue, "want")
## search_term(presidential_debates_2012$dialogue, "sure")
## search_term(presidential_debates_2012$dialogue, "just")
## search_term(presidential_debates_2012$dialogue, "years")

The next probe_ function generates a list of search_term_collocations function calls (search_term_collocations wraps search_term with frequent_terms and eliminates the search term from the output). This allows the user to systematically explore the words that frequently collocate with the original terms.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_list("presidential_debates_2012$dialogue") 

## search_term_collocations(presidential_debates_2012$dialogue, "going")
## search_term_collocations(presidential_debates_2012$dialogue, "make")
## search_term_collocations(presidential_debates_2012$dialogue, "people")
## search_term_collocations(presidential_debates_2012$dialogue, "governor")
## search_term_collocations(presidential_debates_2012$dialogue, "president")

As search_term_collocations has a plot method the user may wish to generate function calls similar to probe_colo_list but wrapped with plot for a visual exploration of the data. The probe_colo_plot_list makes a list of such function calls, whereas the probe_colo_plot plots the output directly to a single external .pdf file.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_plot_list("presidential_debates_2012$dialogue") 

## plot(search_term_collocations(presidential_debates_2012$dialogue, "going"))
## plot(search_term_collocations(presidential_debates_2012$dialogue, "make"))
## plot(search_term_collocations(presidential_debates_2012$dialogue, "people"))
## plot(search_term_collocations(presidential_debates_2012$dialogue, "governor"))
## plot(search_term_collocations(presidential_debates_2012$dialogue, "president"))

The plots can be generated externally with the probe_colo_plot function which makes multi-page .pdf of frequent terms bar plots; one plot for each term.

presidential_debates_2012 %>%
    with(frequent_terms(dialogue, 5)) %>%
    select(term) %>%
    unlist() %>%
    probe_colo_plot("presidential_debates_2012$dialogue") 

View Important Words

It may also be useful to view top min-max scaled tf-idf weighted terms to allow the more information rich terms to bubble to the top. The important_terms function allows the user to do exactly this. The function works similar to term_count but with an information weight.

presidential_debates_2012 %>%
    with(important_terms(dialogue, 10))

##         term    tf_idf
## 1      going 1.0000000
## 2       make 0.8570324
## 3     people 0.8482041
## 4   governor 0.8110754
## 5        get 0.7890439
## 6  president 0.7873159
## 7       said 0.7530954
## 8       want 0.7510015
## 9        one 0.6871579
## 10      sure 0.6852854

Building the Model

To build a model the researcher created a named list of regular expressions that map to a category/tag. This is fed to the term_count function. term_count allows for aggregation by grouping variables but for building the model we usually want to get observation level counts. Set grouping.var = TRUE to generate an id column of 1 through number of observation which gives the researcher the observation level counts.

discoure_markers <- list(
    response_cries = c("\\boh", "\\bah", "aha", "ouch", "yuk"),
    back_channels = c("uh[- ]huh", "uhuh", "yeah"),
    summons = "hey",
    justification = "because"
)

model <- presidential_debates_2012 %>%
    with(term_count(dialogue, grouping.var = TRUE, discoure_markers))

model

## Coverage: 13.02% 
## # A tibble: 2,912 x 6
##       id n.words response_cries back_channels summons justification
##    <int>   <int>          <int>         <int>   <int>         <int>
##  1     1      10              0             0       0             0
##  2     2       9              1             0       0             0
##  3     3      14              0             0       0             0
##  4     4      14              0             0       0             0
##  5     5       5              1             0       0             0
##  6     6       5              0             0       0             0
##  7     7      40              0             0       0             0
##  8     8       2              0             0       0             0
##  9     9      20              0             0       2             0
## 10    10      13              0             0       1             0
## # ... with 2,902 more rows

Testing the Model

In building a classifier the researcher is typically concerned with coverage, discrimination, and accuracy. The first two are easier to obtain while accuracy is not possible to compute without a comparison sample of expertly tagged data.

We want our model to be assigning tags to as many of the text elements as possible. The coverage function can provide an understanding of what percent of the data is tagged. Our model has relatively low coverage, indicating the regular expression model needs to be improved.

model %>%
    coverage()

## Coverage    : 13.0%
## Coverered   :   379
## Not Covered : 2,533

Understanding how well our model discriminates is important as well. We want the model to cover as close to 100% of the data as possible, but likely want fewer tags assigned to each element. If the model is tagging many tags to each element it is not able to discriminate well. The as_terms + plot_freq function provides a visual representation of the model’s ability to discriminate. The output is a bar plot showing the distribution of the number of tags at the element level. The goal is to have a larger density at 1 tag. Note that the plot also gives a view of coverage, as the zero bar shows the frequency of elements that could not be tagged. Our model has a larger distribution of 1 tag compared to the  > 1 tag distributions, though the coverage is very poor. As the number of tags increases the ability of the model to discriminate typically lessens. There is often a trade off between model coverage and discrimination.

model %>%
    as_terms() %>%
    plot_freq(size=3) + xlab("Number of Tags")

We may also want to see the distribution of the tags as well. The combination of as_terms + plot_counts gives the distribution of the tags. In our model the majority of tags are applied to the summons category.

model %>%
    as_terms() %>%
    plot_counts() + xlab("Tags")

Improving the Model

Improving Coverage

The model does not have very good coverage. To improve this the researcher will want to look at the data with no coverage to try to build additional regular expressions and categories. This requires understanding language, noticing additional features of the data with no coverage that may map to categories, and building regular expressions to model these features. This section will outline some of the tools that can be used to detect features and build regular expressions to model these language features.

We first want to view the untagged data. The uncovered function provides a logical vector that can be used to extract the text with no tags.

untagged <- get_uncovered(model)

head(untagged)

## [1] "We'll talk about specifically about health care in a moment."                                                                                                                                              
## [2] "What I support is no change for current retirees and near retirees to Medicare."                                                                                                                           
## [3] "And the president supports taking dollar seven hundred sixteen billion out of that program."                                                                                                               
## [4] "So that's that's number one."                                                                                                                                                                              
## [5] "Number two is for people coming along that are young, what I do to make sure that we can keep Medicare in place for them is to allow them either to choose the current Medicare program or a private plan."
## [6] "Their choice."

The frequent_terms function can be used again to understand common features of the untagged data.

untagged %>%
    frequent_terms()

##    term      frequency
## 1  going     211      
## 2  governor  177      
## 3  president 172      
## 4  people    169      
## 5  make      166      
## 6  said      149      
## 7  want      130      
## 8  sure      110      
## 9  just      107      
## 10 years     101      
## 11 jobs       96      
## 12 romney     95      
## 13 know       82      
## 14 four       81      
## 15 also       78      
## 16 america    77      
## 17 right      76      
## 18 well       74      
## 19 world      72      
## 20 think      66

We may see a common term such as the word right and want to see what other terms collocate with it. Using a regular expression that searches for multiple terms can improve a model’s accuracy and ability to discriminate. Using search_term in combination with frequent_terms can be a powerful way to see which words tend to collocate. Here I pass a regex for right (\\bright) to search_term. This pulls up the text that contains this term. I then use frequent_terms to see what words frequently occur with the word right. We notice the word people tends to occur with right.

untagged %>%
    search_term("\\bright") %>%
    frequent_terms(10, stopwords = "right")

##    term       frequency
## 1  that       32       
## 2  have       12       
## 3  people     10       
## 4  with        9       
## 5  this        8       
## 6  government  7       
## 7  course      6       
## 8  going       6       
## 9  it's        6       
## 10 president   6       
## 11 that's      6       
## 12 want        6       
## 13 you're      6

The search_term_collocations function provides a convenient wrapper for search_term + frequent_terms which also removes the search term from the output.

untagged %>%
    search_term_collocations("\\bright", n=10)

##    term       frequency
## 1  people     10       
## 2  government  7       
## 3  course      6       
## 4  going       6       
## 5  president   6       
## 6  want        6       
## 7  also        5       
## 8  governor    5       
## 9  jobs        5       
## 10 make        5

This is an exploratory act. Finding the right combination of features that occur together requires lots of recursive noticing, trialling, testing, reading, interpreting, and deciding. After we noticed that the terms people and course appear with the term right above we will want to see these text elements. We can use a grouped-or expression with colo to build a regular expression that will search for any text elements that contain these two terms anywhere. colo is more powerful than initially shown here; I demonstrate further functionality below. Here is the regex produced.

colo("\\bright", "(people|course)")

## [1] "((\\bright.*(people|course))|((people|course).*\\bright))"

This is extremely powerful when used inside of search_term as the text containing this regular expression will be returned along with the coverage proportion on the uncovered data.

search_term(untagged, colo("\\bright", "(people|course)"))

## [1 of 15]
## 
## Right now, the CBO says up to twenty million people will lose their insurance
## as Obamacare goes into effect next year.
## 
## 
## ===================================
## [2 of 15]
## 
## The federal government taking over health care for the entire nation and
## whisking aside the tenth Amendment, which gives states the rights for these
## kinds of things, is not the course for America to have a stronger, more vibrant
## economy.
## 
## 
## ===================================
## [3 of 15]
## 
## And what we're seeing right now is, in my view, a a trickle down government
## approach, which has government thinking it can do a better job than free people
## pursuing their drea Miss And it's not working.
## 
## 
## ===================================
## [4 of 15]
## 
## And the challenges America faces right now look, the reason I'm in this race is
## there are people that are really hurting today in this country.
## 
## 
## ===================================
## [5 of 15]
## 
## It's going to help people across the country that are unemployed right now.
## 
## 
## ===================================
## [6 of 15]
## 
## That's not the right course for America.
## 
## 
## ===================================
## [7 of 15]
## 
## The right course for America is to have a true all of the above policy.
## 
## 
## ===================================
## [8 of 15]
## 
## When you've got thousands of people right now in Iowa, right now in Colorado,
## who are working, creating wind power with good paying manufacturing jobs, and
## the Republican senator in that in Iowa is all for it, providing tax breaks to
## help this work and Governor Romney says I'm opposed.
## 
## 
## ===================================
## [9 of 15]
## 
## When it comes to community colleges, we are setting up programs, including with
## Nassau Community College, to retrain workers, including young people who may
## have dropped out of school but now are getting another chance, training them
## for the jobs that exist right now.
## 
## 
## ===================================
## [10 of 15]
## 
## That's not the right course for us.
## 
## 
## ===================================
## [11 of 15]
## 
## The right course for us is to make sure that we go after the the people who are
## leaders of these various anti American groups and these these jihadists, but
## also help the Muslim world.
## 
## 
## ===================================
## [12 of 15]
## 
## And so the right course for us, is working through our partners and with our
## own resources, to identify responsible parties within Syria, organize them,
## bring them together in a in a form of if not government, a form of of of
## council that can take the lead in Syria.
## 
## 
## ===================================
## [13 of 15]
## 
## And it's widely reported that drones are being used in drone strikes, and I
## support that and entirely, and feel the president was right to up the usage of
## that technology, and believe that we should continue to use it, to continue to
## go after the people that represent a threat to this nation and to our friends.
## 
## 
## ===================================
## [14 of 15]
## 
## People can look it up, you're right.
## 
## 
## ===================================
## [15 of 15]
## 
## Those are the kinds of choices that the American people face right now.
## 
## 
## -----------------------------------
## coverage = .00592  >>>  15 of 2,533

We notice right away that the phrase right course appears often. We can create a search with just this expression.

Note that the decision to include a regular expression in the model is up to the researcher. We must guard against over-fitting the model, making it not transferable to new, similar contexts.

search_term(untagged, "right course")

## [1 of 5]
## 
## That's not the right course for America.
## 
## 
## ===================================
## [2 of 5]
## 
## The right course for America is to have a true all of the above policy.
## 
## 
## ===================================
## [3 of 5]
## 
## That's not the right course for us.
## 
## 
## ===================================
## [4 of 5]
## 
## The right course for us is to make sure that we go after the the people who are
## leaders of these various anti American groups and these these jihadists, but
## also help the Muslim world.
## 
## 
## ===================================
## [5 of 5]
## 
## And so the right course for us, is working through our partners and with our
## own resources, to identify responsible parties within Syria, organize them,
## bring them together in a in a form of if not government, a form of of of
## council that can take the lead in Syria.
## 
## 
## -----------------------------------
## coverage = .00197  >>>  5 of 2,533

Based on the frequent_terms output above, the word jobs also seems important. Again, we use the search_term + frequent_terms combo to extract words collocating with jobs.

search_term_collocations(untagged, "jobs", n=15)

##    term          frequency
## 1  million       17       
## 2  create        15       
## 3  going         15       
## 4  back          12       
## 5  country       11       
## 6  people        10       
## 7  make           9       
## 8  sure           9       
## 9  five           8       
## 10 hundred        8       
## 11 overseas       8       
## 12 want           8       
## 13 years          8       
## 14 businesses     7       
## 15 companies      7       
## 16 creating       7       
## 17 energy         7       
## 18 good           7       
## 19 just           7       
## 20 manufacturing  7       
## 21 thousand       7

As stated above, colo is a powerful search tool as it can take multiple regular expressions as well as allowing for multiple negations (i.e., find x but not if y). To include multiple negations use a grouped-or regex as shown below.

## Where do `jobs` and `create` collocate?
search_term(untagged, colo("jobs", "create")) 

## [1 of 21]
## 
## If I'm president I will create help create twelve million new jobs in this
## country with rising incomes.
## 
## 
## ===================================
## [2 of 21]
## 
## I know what it takes to create good jobs again.
## 
## 
## ===================================
## [3 of 21]
## 
## And what I want to do, is build on the five million jobs that we've created
## over the last thirty months in the private sector alone.
## 
## 
## ===================================
## [4 of 21]
## 
## It's going to help those families, and it's going to create incentives to start
## growing jobs again in this country.
## 
## 
## ===================================
## [5 of 21]
## 
## We created twenty three million new jobs.
## 
## 
## ===================================
## [6 of 21]
## 
## two million new jobs created.
## 
## 
## ===================================
## [7 of 21]
## 
## We've created five million jobs, and gone from eight hundred jobs a month being
## lost, and we are making progress.
## 
## 
## ===================================
## [8 of 21]
## 
## He keeps saying, Look, I've created five million jobs.
## 
## 
## ===================================
## [9 of 21]
## 
## eight percent, between that period the end of that recession and the equivalent
## of time to today, Ronald Reagan's recovery created twice as many jobs as this
## president's recovery.
## 
## 
## ===================================
## [10 of 21]
## 
## This is the way we're going to create jobs in this country.
## 
## 
## ===================================
## [11 of 21]
## 
## We have to be competitive if we're going to create more jobs here.
## 
## 
## ===================================
## [12 of 21]
## 
## We need to create jobs here.
## 
## 
## ===================================
## [13 of 21]
## 
## And it's estimated that that will create eight hundred thousand new jobs.
## 
## 
## ===================================
## [14 of 21]
## 
## That's not the way we're going to create jobs here.
## 
## 
## ===================================
## [15 of 21]
## 
## The way we're going to create jobs here is not just to change our tax code, but
## also to double our exports.
## 
## 
## ===================================
## [16 of 21]
## 
## That's going to help to create jobs here.
## 
## 
## ===================================
## [17 of 21]
## 
## Government does not create jobs.
## 
## 
## ===================================
## [18 of 21]
## 
## Government does not create jobs.
## 
## 
## ===================================
## [19 of 21]
## 
## Barry, I think a lot of this campaign, maybe over the last four years, has been
## devoted to this nation that I think government creates jobs, that that somehow
## is the answer.
## 
## 
## ===================================
## [20 of 21]
## 
## And when it comes to our economy here at home, I know what it takes to create
## twelve million new jobs and rising take home pay.
## 
## 
## ===================================
## [21 of 21]
## 
## And Governor Romney wants to take us back to those policies, a foreign policy
## that's wrong and reckless, economic policies that won't create jobs, won't
## reduce our deficit, but will make sure that folks at the very top don't have to
## play by the same rules that you do.
## 
## 
## -----------------------------------
## coverage = .00829  >>>  21 of 2,533

## Where do `jobs`, `create`,  and the word `not` collocate?
search_term(untagged, colo("jobs", "create", "(not|'nt)")) 

## [1 of 4]
## 
## That's not the way we're going to create jobs here.
## 
## 
## ===================================
## [2 of 4]
## 
## The way we're going to create jobs here is not just to change our tax code, but
## also to double our exports.
## 
## 
## ===================================
## [3 of 4]
## 
## Government does not create jobs.
## 
## 
## ===================================
## [4 of 4]
## 
## Government does not create jobs.
## 
## 
## -----------------------------------
## coverage = .00158  >>>  4 of 2,533

## Where do `jobs` and`create` collocate without a `not` word?
search_term(untagged, colo("jobs", "create", not = "(not|'nt)")) 

## [1 of 17]
## 
## If I'm president I will create help create twelve million new jobs in this
## country with rising incomes.
## 
## 
## ===================================
## [2 of 17]
## 
## I know what it takes to create good jobs again.
## 
## 
## ===================================
## [3 of 17]
## 
## And what I want to do, is build on the five million jobs that we've created
## over the last thirty months in the private sector alone.
## 
## 
## ===================================
## [4 of 17]
## 
## It's going to help those families, and it's going to create incentives to start
## growing jobs again in this country.
## 
## 
## ===================================
## [5 of 17]
## 
## We created twenty three million new jobs.
## 
## 
## ===================================
## [6 of 17]
## 
## two million new jobs created.
## 
## 
## ===================================
## [7 of 17]
## 
## We've created five million jobs, and gone from eight hundred jobs a month being
## lost, and we are making progress.
## 
## 
## ===================================
## [8 of 17]
## 
## He keeps saying, Look, I've created five million jobs.
## 
## 
## ===================================
## [9 of 17]
## 
## eight percent, between that period the end of that recession and the equivalent
## of time to today, Ronald Reagan's recovery created twice as many jobs as this
## president's recovery.
## 
## 
## ===================================
## [10 of 17]
## 
## This is the way we're going to create jobs in this country.
## 
## 
## ===================================
## [11 of 17]
## 
## We have to be competitive if we're going to create more jobs here.
## 
## 
## ===================================
## [12 of 17]
## 
## We need to create jobs here.
## 
## 
## ===================================
## [13 of 17]
## 
## And it's estimated that that will create eight hundred thousand new jobs.
## 
## 
## ===================================
## [14 of 17]
## 
## That's going to help to create jobs here.
## 
## 
## ===================================
## [15 of 17]
## 
## Barry, I think a lot of this campaign, maybe over the last four years, has been
## devoted to this nation that I think government creates jobs, that that somehow
## is the answer.
## 
## 
## ===================================
## [16 of 17]
## 
## And when it comes to our economy here at home, I know what it takes to create
## twelve million new jobs and rising take home pay.
## 
## 
## ===================================
## [17 of 17]
## 
## And Governor Romney wants to take us back to those policies, a foreign policy
## that's wrong and reckless, economic policies that won't create jobs, won't
## reduce our deficit, but will make sure that folks at the very top don't have to
## play by the same rules that you do.
## 
## 
## -----------------------------------
## coverage = .00671  >>>  17 of 2,533

## Where do `jobs`, `romney`, and `create` collocate?
search_term(untagged, colo("jobs", "create", "romney")) 

## [1 of 1]
## 
## And Governor Romney wants to take us back to those policies, a foreign policy
## that's wrong and reckless, economic policies that won't create jobs, won't
## reduce our deficit, but will make sure that folks at the very top don't have to
## play by the same rules that you do.
## 
## 
## -----------------------------------
## coverage = .00039  >>>  1 of 2,533

Here is one more example with colo for the words jobs and overseas. The user may want to quickly test and then transfer the regex created by colo to the regular expression list. By setting options(termco.copy2clip = TRUE) the user globally sets colo to use the clipr package to copy the regex to the clipboard for better work flow.

search_term(untagged, colo("jobs", "overseas")) 

## [1 of 8]
## 
## And everything that I've tried to do, and everything that I'm now proposing for
## the next four years in terms of improving our education system or developing
## American energy or making sure that we're closing loopholes for companies that
## are shipping jobs overseas and focusing on small businesses and companies that
## are creating jobs here in the United States, or closing our deficit in a
## responsible, balanced way that allows us to invest in our future.
## 
## 
## ===================================
## [2 of 8]
## 
## You can ship jobs overseas and get tax breaks for it.
## 
## 
## ===================================
## [3 of 8]
## 
## The outsourcing of American jobs overseas has taken a toll on our economy.
## 
## 
## ===================================
## [4 of 8]
## 
## Making sure that we're bringing manufacturing back to our shores so that we're
## creating jobs here, as we've done with the auto industry, not rewarding
## companies that are shipping jobs overseas.
## 
## 
## ===================================
## [5 of 8]
## 
## I know Americans had seen jobs being shipped overseas; businesses and workers
## not getting a level playing field when it came to trade.
## 
## 
## ===================================
## [6 of 8]
## 
## Having a tax code that rewards companies that are shipping jobs overseas
## instead of companies that are investing here in the United States, that will
## not make us more competitive.
## 
## 
## ===================================
## [7 of 8]
## 
## And the one thing that I'm absolutely clear about is that after a decade in
## which we saw drift, jobs being shipped overseas, nobody championing American
## workers and American businesses, we've now begun to make some real progress.
## 
## 
## ===================================
## [8 of 8]
## 
## And I've put forward a plan to make sure that we're bringing manufacturing jobs
## back to our shores by rewarding companies and small businesses that are
## investing here, not overseas.
## 
## 
## -----------------------------------
## coverage = .00316  >>>  8 of 2,533

The researcher uses an iterative process to continue to build the regular expression list. The term_count function builds the matrix of counts to further test the model. The use of (a) coverage, (b) as_terms + plot_counts, and (c) as_terms + freq_counts will allow for continued testing of model functioning.

Improving Discrimination

It is often desirable to improve discrimination. While the bar plot highlighting the distribution of the number of tags is useful, it only indicates if there is a problem, not where the problem lies. The tag_co_occurrence function produces a list of data.frame and matrices that aide in understanding how to improve discrimination. This list is useful, but the plot method provides an improved visual view of the co-occurrences of tags.

The network plot on the left shows the strength of relationships between tags, while the plot on the right shows the average number of other tags that co-occur with each regex tag. In this particular case the plot combo is not complex because of the limited number of regex tags. Note that the edge strength is relative to all other edges. The strength has to be considered in the context of the average number of other tags that co-occur with each regex tag bar/dot plot on the right. As the number of tags increases the plot increases in complexity. The unconnected nodes and shorter bars represent the tags that provide the best discriminatory power, whereas the other tags have the potential to be redundant.

tag_co_occurrence(model) %>%
    plot(min.edge.cutoff = .01)

Another way to view the overlapping complexity and relationships between tags is to use an Upset plot. The plot_upset function wraps UpSetR::upset and is made to handle term_count objects directly. Upset plots are complex and require study of the method in order to interpret the results (http://caleydo.org/tools/upset). The time invested in learning this plot type can be very fruitful in utilizing a technique that scales to the types of data sets that termco outputs. This tool can be useful in order to understand overlap and thus improve discrimination.

plot_upset(model) 

Categorizing/Tagging

The classify function enables the researcher to apply n tags to each text element. Depending on the text and the regular expression list’s ability, multiple tags may be applied to a text. The n argument allows the maximum number of tags to be set though the function does not guarantee this many (or any) tags will be assigned.

Here I show the head of the returned vector (if n > 1 a list may be returned) as well as a table and plot of the counts. Use n = Inf to return all tags.

classify(model) %>%
    head()

## [1] NA               "response_cries" NA               NA              
## [5] "response_cries" NA

classify(model) %>%
    unlist() %>%
    table()

## .
##  back_channels  justification response_cries        summons 
##              6            125             17            231

classify(model) %>%
    unlist() %>%
    plot_counts() + xlab("Tags")

Evaluation: Accuracy

Pre Coded Data

The evaluate function is a more formal method of evaluation than validate_model. The evaluate function yields a test a model’s accuracy, precision, and recall using macro and micro averages of the confusion matrices for each tag as outlined by Dan Jurafsky & Chris Manning. The function requires a known, human coded sample. In the example below I randomly generate “known human coded tagged” vector. Obviously, this is for demonstration purposes. The model outputs a pretty printing of a list. Note that if a larger, known tagging set of data is available the user may want to strongly consider machine learning models (see: RTextTools).

This minimal example will provide insight into the way the evaluate scores behave:

known <- list(1:3, 3, NA, 4:5, 2:4, 5, integer(0))
tagged <- list(1:3, 3, 4, 5:4, c(2, 4:3), 5, integer(0))
evaluate(tagged, known)

## ----------------------------------------------- 
## Tag Level Measures
## ----------------------------------------------- 
##           tag precision recall F_score accuracy
##             1     1.000  1.000   1.000    1.000
##             2     1.000  1.000   1.000    1.000
##             3     1.000  1.000   1.000    1.000
##             4      .667  1.000    .800     .857
##             5     1.000  1.000   1.000    1.000
## No_Code_Given      .000   .000    .000     .857
## 
## -------------------- 
## Summary Measures
## -------------------- 
## N:                 7
## 
## Macro-Averaged  
##   Accuracy:     .952
##   F-score:      .800
##   Precision:    .778
##   Recall:       .833
## 
## Micro-Averaged  
##   Accuracy:     .952
##   F-score:      .909
##   Precision:    .909
##   Recall:       .909

Below we create fake “known” tags to test evaluate with real data (though the comparison is fabricated).

mod1 <- presidential_debates_2012 %>%
    with(term_count(dialogue, TRUE, discoure_markers)) %>%
    classify()

fake_known <- mod1
set.seed(1)
fake_known[sample(1:length(fake_known), 300)] <- "random noise"

evaluate(mod1, fake_known)

## ------------------------------------------------ 
## Tag Level Measures
## ------------------------------------------------ 
##            tag precision recall F_score accuracy
##  back_channels     1.000  1.000   1.000    1.000
##  justification      .902  1.000    .949     .996
##  No_Code_Given      .896  1.000    .945     .909
##   random noise      .000   .000    .000     .897
## response_cries      .812  1.000    .897     .999
##        summons      .910  1.000    .953     .993
## 
## -------------------- 
## Summary Measures
## -------------------- 
## N:             2,912
## 
## Macro-Averaged  
##   Accuracy:     .966
##   F-score:      .791
##   Precision:    .753
##   Recall:       .833
## 
## Micro-Averaged  
##   Accuracy:     .966
##   F-score:      .897
##   Precision:    .897
##   Recall:       .897

Post Coding Data

It is often useful to less formally, validate a model via human evaluation; checking that text is being tagged as expected. This approach is more formative and less rigorous than evaluate, intended to be used to assess model functioning in order to improve it. The validate_model provides an interactive interface for a single evaluator to sample n tags and corresponding texts and assess the accuracy of the tag to the text. The assign_validation_task generates an external file(s) for n coders for redundancy of code assessments. This may be of use in Mechanical Turk type applications. The example below demonstrates validate_model’s print/summary and plot outputs.

validated <- model %>%
    validate_model()

After validate_model has been run the print/summary and plot provides an accuracy of each tag and a confidence level (note that the confidence band is highly affected by the number of samples per tag).

validated

## -------
## Overall:
## -------
##    accuracy n.tagged n.classified sampled  se lower upper
## 1:    59.6%      484          328      57 .06 46.9% 72.4%
## 
## 
## ---------------
## Individual Tags:
## ---------------
##               tag accuracy n.tagged n.classified sampled  se lower  upper
## 1:  back_channels    83.3%        7            6       6 .15 53.5% 100.0%
## 2: response_cries    72.7%       13           11      11 .13 46.4%  99.0%
## 3:  justification    55.0%      155          122      20 .11 33.2%  76.8%
## 4:        summons    50.0%      309          189      20 .11 28.1%  71.9%

plot(validated)

These examples give guidance on how to use the tools in the termco package to build an expert rules, regular expression text classification model.