GitHub - amacanovic/KeynessMeasures: An R package for word keyness analysis using statistical significance and effect size measures

KeynessMasures

An R package including multiple measures of word keyness used in computational linguistics. Introducing both statistical significance and effect size measures in several simple functions.

This package has been updated to support new quanteda package versions (>v3). However, this means it will not be compatible with previous versions anymore. The old functions, compatible with quanteda v2, can be found in folder R, in the script "frequency_table_creator_old.R".

This package currently supports the following measures:

Statistical significance: Log-likelihood ratio, Bayesian Information Criterion.
Effect size: Effect Size of Log Likelihood, %DIFF, The Relative Risk, The Log Ratio measure, The Odds Ratio.

For more details on using the effect size measures, consult the vignette "KeynessMeasures: Introduction to effect size measures".

Installing the package

This package can be installed from GitHub, using devtools.

To download the package from GitHub, use the following command:

devtools::install_github("amacanovic/KeynessMeasures")

Then, load the package:

library(KeynessMeasures)

For more information on included measures, type:

vignette("Effect_size_measures")

For a detailed tutorial vignette, type:

vignette("Keyness_Measures_Tutorial")

Demonstration

Using the keyness_measure_calculator() function, one can easily obtain several measures of word keyness in a target corpus compared to the reference corpus.

Demonstration using the Jane Austen data from the janeaustenr package (Silge, 2017). We will be exploring key words in her novel "Emma" compared to 5 other novels.

jane_austen_data <- janeaustenr::austen_books()

First, obtaining the word frequencies in "Emma" (target corpus) and the other novels (reference corpus) using the frequency_table() function:

frequency_table <- frequency_table_creator(jane_austen_data,
                                           text_field = "text",
                                           grouping_variable = "book",
                                           grouping_variable_target = "Emma",
                                           lemmatize = TRUE,
                                           remove_punct = TRUE,
                                           remove_symbols = TRUE,
                                           remove_numbers = TRUE,
                                           remove_url = TRUE)

Then, calculating all the keyness measures using the keyness_measure_calculator() function, sorting the entries by highest values of log likelihood:

keyness_measures <- keyness_measure_calculator(
  frequency_table,
  log_likelihood = TRUE,
  ell = TRUE,
  bic = TRUE,
  perc_diff = TRUE,
  relative_risk = TRUE,
  log_ratio = TRUE,
  odds_ratio = TRUE,
  sort = "decreasing",
  sort_by = "log_likelihood")

Displaying the first 10 rows of results sorted by log likelihood values:

	word	freq_target_corpus	freq_reference_corpus	word_use	log_likelihood	ell	bic	perc_diff	relative_risk	log_ratio	odds_ratio
7594	emma	786	1	overuse	2350.3189	0.0006281	2336.8255	2.751670e+05	2752.6701	11.426616	2766.17371
5442	harriet	415	4	overuse	1205.6112	0.0003670	1192.1178	3.623455e+04	363.3455	8.505198	364.28214
7601	weston	389	0	overuse	1170.5395	0.0003623	1157.0461	2.416870e+17	Inf	11.411857	Inf
7613	knightley	356	0	overuse	1071.2392	0.0003383	1057.7458	2.211840e+17	Inf	11.283964	Inf
7622	elton	320	0	overuse	962.9117	0.0003117	949.4183	1.988170e+17	Inf	11.130159	Inf
7595	woodhouse	278	0	overuse	836.5295	0.0002800	823.0361	1.727223e+17	Inf	10.927172	Inf
7794	fairfax	210	0	overuse	631.9108	0.0002269	618.4174	1.304737e+17	Inf	10.522476	Inf
7625	churchill	193	0	overuse	580.7561	0.0002133	567.2627	1.199115e+17	Inf	10.400688	Inf
1727	frank	200	9	overuse	532.1223	0.0001913	518.6289	7.682500e+03	77.8250	6.282162	77.92058
7605	hartfield	159	0	overuse	478.4467	0.0001852	464.9533	9.878722e+16	Inf	10.121113	Inf

Instead, sorting the words by %DIFF measure:

	word	freq_target_corpus	word_use	log_likelihood	ell	bic	perc_diff	relative_risk	log_ratio	odds_ratio
7601	weston	389	overuse	1170.5395	0.0003623	1157.0461	2.416870e+17	Inf	11.411857	Inf
7613	knightley	356	overuse	1071.2392	0.0003383	1057.7458	2.211840e+17	Inf	11.283964	Inf
7622	elton	320	overuse	962.9117	0.0003117	949.4183	1.988170e+17	Inf	11.130159	Inf
7595	woodhouse	278	overuse	836.5295	0.0002800	823.0361	1.727223e+17	Inf	10.927172	Inf
7794	fairfax	210	overuse	631.9108	0.0002269	618.4174	1.304737e+17	Inf	10.522476	Inf
7625	churchill	193	overuse	580.7561	0.0002133	567.2627	1.199115e+17	Inf	10.400688	Inf
7605	hartfield	159	overuse	478.4467	0.0001852	464.9533	9.878722e+16	Inf	10.121113	Inf
7635	bate	126	overuse	379.1465	0.0001570	365.6531	7.828421e+16	Inf	9.785510	Inf
7607	highbury	125	overuse	376.1374	0.0001562	362.6440	7.766291e+16	Inf	9.774015	Inf
7684	harriet's	91	overuse	273.8280	0.0001257	260.3346	5.653860e+16	Inf	9.316025	Inf

Name		Name	Last commit message	Last commit date
Latest commit History 15 Commits
R		R
inst/doc		inst/doc
man		man
tests		tests
vignettes		vignettes
.Rbuildignore		.Rbuildignore
.gitignore		.gitignore
DESCRIPTION		DESCRIPTION
KeynessMeasures.Rproj		KeynessMeasures.Rproj
LICENSE.md		LICENSE.md
NAMESPACE		NAMESPACE
README.Rmd		README.Rmd
README.html		README.html
README.md		README.md

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Repository files navigation

KeynessMasures

Installing the package

Demonstration

About

Releases

Packages

Languages

License

amacanovic/KeynessMeasures

Folders and files

Latest commit

History

Repository files navigation

KeynessMasures

Installing the package

Demonstration

About

Resources

License

Stars

Watchers

Forks

Releases

Packages 0

Languages

Packages