Stopwords by Information Content

This page explains the algorithm to identify stopwords for a corpus by computing the information content of the words as given in the paper "A universal information theoretic approach to the identification of stopwords" by Martin Gerlach, Hanyu Shi, and Luís A. Nunes Amaral (2019).

Notation

Following the notation in the paper we suppose we have a corpus C of documents. We presuppose some way of tokenizing each document into "words". We let n(w,d) be the number of occurrences of word w in document d.

We can think of the counts as a kind of 2-dimensional matrix. And we have sums for each row and column:

n(d) = sum { n(w,d) : for all w }   is the number of tokens in document d
n(w) = sum { n(w,d) : for all d }   is the number of occurrences of w in the corpus

The authors let D be the number of documents in the corpus and N be the total number of tokens over every document (i.e. N = sum { n(w,d) : for all w, d }).

From the counts we can compute the probability of a given document given a word w as

p(d | w) = n(w,d) / n(w).

The idea of the paper is to make a stop word list by finding the words in a corpus that do not have much information. "Not much information" can refer to both words that are used over all the documents, and to words that are only found in a single document.

The classical way of measuring information content is with entropy, defined as

H(w | C) = - sum { p(d | w) log p(d | w) : for all d in C }

Entropy is a non-negative real number. (In this situation the maximum entropy is log(D), where D is the number of documents in the corpus.) The problem with just using entropy is that a word that appears randomly across the whole corpus would have a high entropy. On the other hand a word appearing in exactly one document would have an entropy of 0. The authors address this by asking what the entropy of a word would be if all the words are shuffled randomly between documents. They call this measure H~. Since there are many ways to shuffle the words randomly, they take the average over all shuffles and denote that by <H~(w | C)>.

What we want to look for are words that have a big difference between their entropy H and the average of the random models <H~>.

I(w | C) = <H~(w | C)> - H( w | C )

The random model is time consuming to compute, so the authors do provide an approximation:

H~(w | C) ≈ log (1 - exp( -n(w) / D ))

Algorithm

Compute n(w,d) over all documents d.
Calculate n(w) and H(w | C) for all words w appearing in the corpus.
Compute I(w | C) for all words w appearing in the corpus.
Pick a threshold L > 0. For all words, add a word w to the stop word list if I(w | C) is less than L.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Stopwords by Information Content

Notation

Algorithm

Clone this wiki locally