An R package including multiple measures of word keyness used in computational linguistics. Introducing both statistical significance and effect size measures in several simple functions.
This package has been updated to support new quanteda package versions (>v3). However, this means it will not be compatible with previous versions anymore. The old functions, compatible with quanteda v2, can be found in folder R, in the script "frequency_table_creator_old.R".
This package currently supports the following measures:
-
Statistical significance: Log-likelihood ratio, Bayesian Information Criterion.
-
Effect size: Effect Size of Log Likelihood, %DIFF, The Relative Risk, The Log Ratio measure, The Odds Ratio.
For more details on using the effect size measures, consult the vignette "KeynessMeasures: Introduction to effect size measures".
This package can be installed from GitHub, using devtools.
To download the package from GitHub, use the following command:
devtools::install_github("amacanovic/KeynessMeasures")
Then, load the package:
library(KeynessMeasures)
For more information on included measures, type:
vignette("Effect_size_measures")
For a detailed tutorial vignette, type:
vignette("Keyness_Measures_Tutorial")
Using the keyness_measure_calculator()
function, one can easily obtain
several measures of word keyness in a target corpus compared to the
reference corpus.
Demonstration using the Jane Austen data from the janeaustenr package (Silge, 2017). We will be exploring key words in her novel "Emma" compared to 5 other novels.
jane_austen_data <- janeaustenr::austen_books()
First, obtaining the word frequencies in "Emma" (target corpus) and the other
novels (reference corpus) using the frequency_table()
function:
frequency_table <- frequency_table_creator(jane_austen_data,
text_field = "text",
grouping_variable = "book",
grouping_variable_target = "Emma",
lemmatize = TRUE,
remove_punct = TRUE,
remove_symbols = TRUE,
remove_numbers = TRUE,
remove_url = TRUE)
Then, calculating all the keyness measures using the keyness_measure_calculator()
function,
sorting the entries by highest values of log likelihood:
keyness_measures <- keyness_measure_calculator(
frequency_table,
log_likelihood = TRUE,
ell = TRUE,
bic = TRUE,
perc_diff = TRUE,
relative_risk = TRUE,
log_ratio = TRUE,
odds_ratio = TRUE,
sort = "decreasing",
sort_by = "log_likelihood")
Displaying the first 10 rows of results sorted by log likelihood values:
word | freq_target_corpus | freq_reference_corpus | word_use | log_likelihood | ell | bic | perc_diff | relative_risk | log_ratio | odds_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|
7594 | emma | 786 | 1 | overuse | 2350.3189 | 0.0006281 | 2336.8255 | 2.751670e+05 | 2752.6701 | 11.426616 | 2766.17371 |
5442 | harriet | 415 | 4 | overuse | 1205.6112 | 0.0003670 | 1192.1178 | 3.623455e+04 | 363.3455 | 8.505198 | 364.28214 |
7601 | weston | 389 | 0 | overuse | 1170.5395 | 0.0003623 | 1157.0461 | 2.416870e+17 | Inf | 11.411857 | Inf |
7613 | knightley | 356 | 0 | overuse | 1071.2392 | 0.0003383 | 1057.7458 | 2.211840e+17 | Inf | 11.283964 | Inf |
7622 | elton | 320 | 0 | overuse | 962.9117 | 0.0003117 | 949.4183 | 1.988170e+17 | Inf | 11.130159 | Inf |
7595 | woodhouse | 278 | 0 | overuse | 836.5295 | 0.0002800 | 823.0361 | 1.727223e+17 | Inf | 10.927172 | Inf |
7794 | fairfax | 210 | 0 | overuse | 631.9108 | 0.0002269 | 618.4174 | 1.304737e+17 | Inf | 10.522476 | Inf |
7625 | churchill | 193 | 0 | overuse | 580.7561 | 0.0002133 | 567.2627 | 1.199115e+17 | Inf | 10.400688 | Inf |
1727 | frank | 200 | 9 | overuse | 532.1223 | 0.0001913 | 518.6289 | 7.682500e+03 | 77.8250 | 6.282162 | 77.92058 |
7605 | hartfield | 159 | 0 | overuse | 478.4467 | 0.0001852 | 464.9533 | 9.878722e+16 | Inf | 10.121113 | Inf |
word | freq_target_corpus | freq_reference_corpus | word_use | log_likelihood | ell | bic | perc_diff | relative_risk | log_ratio | odds_ratio | |
---|---|---|---|---|---|---|---|---|---|---|---|
7601 | weston | 389 | 0 | overuse | 1170.5395 | 0.0003623 | 1157.0461 | 2.416870e+17 | Inf | 11.411857 | Inf |
7613 | knightley | 356 | 0 | overuse | 1071.2392 | 0.0003383 | 1057.7458 | 2.211840e+17 | Inf | 11.283964 | Inf |
7622 | elton | 320 | 0 | overuse | 962.9117 | 0.0003117 | 949.4183 | 1.988170e+17 | Inf | 11.130159 | Inf |
7595 | woodhouse | 278 | 0 | overuse | 836.5295 | 0.0002800 | 823.0361 | 1.727223e+17 | Inf | 10.927172 | Inf |
7794 | fairfax | 210 | 0 | overuse | 631.9108 | 0.0002269 | 618.4174 | 1.304737e+17 | Inf | 10.522476 | Inf |
7625 | churchill | 193 | 0 | overuse | 580.7561 | 0.0002133 | 567.2627 | 1.199115e+17 | Inf | 10.400688 | Inf |
7605 | hartfield | 159 | 0 | overuse | 478.4467 | 0.0001852 | 464.9533 | 9.878722e+16 | Inf | 10.121113 | Inf |
7635 | bate | 126 | 0 | overuse | 379.1465 | 0.0001570 | 365.6531 | 7.828421e+16 | Inf | 9.785510 | Inf |
7607 | highbury | 125 | 0 | overuse | 376.1374 | 0.0001562 | 362.6440 | 7.766291e+16 | Inf | 9.774015 | Inf |
7684 | harriet's | 91 | 0 | overuse | 273.8280 | 0.0001257 | 260.3346 | 5.653860e+16 | Inf | 9.316025 | Inf |