Skip to content

An R package for word keyness analysis using statistical significance and effect size measures

License

Notifications You must be signed in to change notification settings

amacanovic/KeynessMeasures

Repository files navigation

KeynessMasures

An R package including multiple measures of word keyness used in computational linguistics. Introducing both statistical significance and effect size measures in several simple functions.

This package has been updated to support new quanteda package versions (>v3). However, this means it will not be compatible with previous versions anymore. The old functions, compatible with quanteda v2, can be found in folder R, in the script "frequency_table_creator_old.R".

This package currently supports the following measures:

  1. Statistical significance: Log-likelihood ratio, Bayesian Information Criterion.

  2. Effect size: Effect Size of Log Likelihood, %DIFF, The Relative Risk, The Log Ratio measure, The Odds Ratio.

For more details on using the effect size measures, consult the vignette "KeynessMeasures: Introduction to effect size measures".

Installing the package

This package can be installed from GitHub, using devtools.

To download the package from GitHub, use the following command:

devtools::install_github("amacanovic/KeynessMeasures")

Then, load the package:

library(KeynessMeasures)

For more information on included measures, type:

vignette("Effect_size_measures")

For a detailed tutorial vignette, type:

vignette("Keyness_Measures_Tutorial")

Demonstration

Using the keyness_measure_calculator() function, one can easily obtain several measures of word keyness in a target corpus compared to the reference corpus.

Demonstration using the Jane Austen data from the janeaustenr package (Silge, 2017). We will be exploring key words in her novel "Emma" compared to 5 other novels.

jane_austen_data <- janeaustenr::austen_books()

First, obtaining the word frequencies in "Emma" (target corpus) and the other novels (reference corpus) using the frequency_table() function:

frequency_table <- frequency_table_creator(jane_austen_data,
                                           text_field = "text",
                                           grouping_variable = "book",
                                           grouping_variable_target = "Emma",
                                           lemmatize = TRUE,
                                           remove_punct = TRUE,
                                           remove_symbols = TRUE,
                                           remove_numbers = TRUE,
                                           remove_url = TRUE)

Then, calculating all the keyness measures using the keyness_measure_calculator() function, sorting the entries by highest values of log likelihood:

keyness_measures <- keyness_measure_calculator(
  frequency_table,
  log_likelihood = TRUE,
  ell = TRUE,
  bic = TRUE,
  perc_diff = TRUE,
  relative_risk = TRUE,
  log_ratio = TRUE,
  odds_ratio = TRUE,
  sort = "decreasing",
  sort_by = "log_likelihood")

Displaying the first 10 rows of results sorted by log likelihood values:

word freq_target_corpus freq_reference_corpus word_use log_likelihood ell bic perc_diff relative_risk log_ratio odds_ratio
7594 emma 786 1 overuse 2350.3189 0.0006281 2336.8255 2.751670e+05 2752.6701 11.426616 2766.17371
5442 harriet 415 4 overuse 1205.6112 0.0003670 1192.1178 3.623455e+04 363.3455 8.505198 364.28214
7601 weston 389 0 overuse 1170.5395 0.0003623 1157.0461 2.416870e+17 Inf 11.411857 Inf
7613 knightley 356 0 overuse 1071.2392 0.0003383 1057.7458 2.211840e+17 Inf 11.283964 Inf
7622 elton 320 0 overuse 962.9117 0.0003117 949.4183 1.988170e+17 Inf 11.130159 Inf
7595 woodhouse 278 0 overuse 836.5295 0.0002800 823.0361 1.727223e+17 Inf 10.927172 Inf
7794 fairfax 210 0 overuse 631.9108 0.0002269 618.4174 1.304737e+17 Inf 10.522476 Inf
7625 churchill 193 0 overuse 580.7561 0.0002133 567.2627 1.199115e+17 Inf 10.400688 Inf
1727 frank 200 9 overuse 532.1223 0.0001913 518.6289 7.682500e+03 77.8250 6.282162 77.92058
7605 hartfield 159 0 overuse 478.4467 0.0001852 464.9533 9.878722e+16 Inf 10.121113 Inf
Instead, sorting the words by %DIFF measure:
word freq_target_corpus freq_reference_corpus word_use log_likelihood ell bic perc_diff relative_risk log_ratio odds_ratio
7601 weston 389 0 overuse 1170.5395 0.0003623 1157.0461 2.416870e+17 Inf 11.411857 Inf
7613 knightley 356 0 overuse 1071.2392 0.0003383 1057.7458 2.211840e+17 Inf 11.283964 Inf
7622 elton 320 0 overuse 962.9117 0.0003117 949.4183 1.988170e+17 Inf 11.130159 Inf
7595 woodhouse 278 0 overuse 836.5295 0.0002800 823.0361 1.727223e+17 Inf 10.927172 Inf
7794 fairfax 210 0 overuse 631.9108 0.0002269 618.4174 1.304737e+17 Inf 10.522476 Inf
7625 churchill 193 0 overuse 580.7561 0.0002133 567.2627 1.199115e+17 Inf 10.400688 Inf
7605 hartfield 159 0 overuse 478.4467 0.0001852 464.9533 9.878722e+16 Inf 10.121113 Inf
7635 bate 126 0 overuse 379.1465 0.0001570 365.6531 7.828421e+16 Inf 9.785510 Inf
7607 highbury 125 0 overuse 376.1374 0.0001562 362.6440 7.766291e+16 Inf 9.774015 Inf
7684 harriet's 91 0 overuse 273.8280 0.0001257 260.3346 5.653860e+16 Inf 9.316025 Inf

About

An R package for word keyness analysis using statistical significance and effect size measures

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages