Skip to content

Commit

Permalink
Updated docs
Browse files Browse the repository at this point in the history
  • Loading branch information
x-tabdeveloping committed Oct 22, 2024
1 parent d50794a commit cbf202e
Show file tree
Hide file tree
Showing 6 changed files with 280 additions and 417 deletions.
71 changes: 28 additions & 43 deletions docs/GMM.md
Original file line number Diff line number Diff line change
Expand Up @@ -9,47 +9,51 @@ These Gaussian components are assumed to be the topics.
<figcaption>Components of a Gaussian Mixture Model <br>(figure from scikit-learn documentation)</figcaption>
</figure>

## The Model
## How does GMM work?

### 1. Generative Modeling

GMM assumes that the embeddings are generated according to the following stochastic process:

1. Select global topic weights: $\Theta$
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
3. For each document:
- Draw topic label: $z \sim Categorical(\Theta)$
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$
### Generative Modeling

GMM assumes that the embeddings are generated according to the following stochastic process from a number of Gaussian components.
Priors are optionally imposed on the model parameters.
The model is fitted either using expectation maximization or variational inference.

### 2. Topic Inference over Documents
??? info "Click to see formula"
1. Select global topic weights: $\Theta$
2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ .
3. For each document:
- Draw topic label: $z \sim Categorical(\Theta)$
- Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$


### Calculate Topic Probabilities

After the model is fitted, soft topic labels are inferred for each document.
A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings.

Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$
??? info "Click to see formula"
- For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$

### 3. Soft c-TF-IDF

Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__,
an extension of __c-TF-IDF__, that can be used with continuous labels.

Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:
??? info "Click to see formula"

Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$.
Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner:

- Estimate weight of term $j$ for topic $z$: <br>
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
- Estimate inverse document/topic frequency for term $j$:
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
$N$ is the total number of documents.
- Calculate importance of term $j$ for topic $z$:
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$
- Estimate weight of term $j$ for topic $z$: <br>
$tf_{zj} = \frac{t_{zj}}{w_z}$, where
$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and
$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$ <br>
- Estimate inverse document/topic frequency for term $j$:
$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where
$N$ is the total number of documents.
- Calculate importance of term $j$ for topic $z$:
$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$

### _(Optional)_ 4. Dynamic Modeling
### Dynamic Modeling

GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents.
To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately.
Expand Down Expand Up @@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA
model = GMM(20, dimensionality_reduction=IncrementalPCA(20))
```

## Considerations

### Strengths

- Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable.
- Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail.
This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well.
- Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents.
- Dynamic Modeling: You can model changes in topics over time using GMM.

### Weaknesses

- Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues.
- Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case.
Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings.
- Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model.
- Moderate Robustness to Noise: GMM is similarly sensitive to noise and stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF.


## API Reference

::: turftopic.models.gmm.GMM
144 changes: 65 additions & 79 deletions docs/KeyNMF.md
Original file line number Diff line number Diff line change
Expand Up @@ -19,42 +19,42 @@ model.fit(corpus)
model.print_topics()
```

## Keyword Extraction
## How does KeyNMF work?

The first step of the process is gaining enhanced representations of documents by using contextual embeddings.
Both the documents and the vocabulary get encoded with the same sentence encoder.
Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document.
Only the top K words with positive cosine similarity to the document are kept.
These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document,
and each row is a document. The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space.
### Keyword Extraction

- For each document $d$:
1. Let $x_d$ be the document's embedding produced with the encoder model.
2. For each word $w$ in the document $d$:
1. Let $v_w$ be the word's embedding produced with the encoder model.
2. Calculate cosine similarity between word and document
KeyNMF discovers topics based on the importances of keywords for a given document.
This is done by embedding words in a document, and then extracting the cosine similarities of documents to words using a transformer-model.
Only the `top_n` keywords with positive similarity are kept.

$$
\text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||}
$$
??? info "Click to see formula"
- For each document $d$:
1. Let $x_d$ be the document's embedding produced with the encoder model.
2. For each word $w$ in the document $d$:
1. Let $v_w$ be the word's embedding produced with the encoder model.
2. Calculate cosine similarity between word and document

$$
\text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||}
$$

3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$.
3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$.

$$
K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where }
|K_d| = N\text{, and } \\
w \in d
$$
$$
K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where }
|K_d| = N\text{, and } \\
w \in d
$$

- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords.
- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords.

$$
M_{dw} =
\begin{cases}
\text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\
0, & \text{otherwise}.
\end{cases}
$$
$$
M_{dw} =
\begin{cases}
\text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\
0, & \text{otherwise}.
\end{cases}
$$

You can do this step manually if you want to precompute the keyword matrix.
Keywords are represented as dictionaries mapping words to keyword importances.
Expand All @@ -78,19 +78,22 @@ keyword_matrix = model.extract_keywords(corpus)
model.fit(None, keywords=keyword_matrix)
```

## Topic Discovery
### Topic Discovery

Topics in this matrix are then discovered using Non-negative Matrix Factorization.
Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance
can be explained.

- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:
??? info "Click to see formula"

- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss:

$$
L(W,H) = ||M - WH||^2
$$

$$
L(W,H) = ||M - WH||^2
$$
You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.

You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords.
```python
# Fitting just on the corpus
model.fit(corpus)
Expand All @@ -109,7 +112,7 @@ keyword_matrix = model.extract_keywords(corpus)
model.fit(None, keywords=keyword_matrix)
```

## Asymmetric and Instruction-tuned Embedding Models
### Asymmetric and Instruction-tuned Embedding Models

Some embedding models can be used together with prompting, or encode queries and passages differently.
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
Expand Down Expand Up @@ -154,21 +157,23 @@ model = KeyNMF(10, encoder=encoder)
Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.


## Dynamic Topic Modeling
### Dynamic Topic Modeling

KeyNMF is also capable of modeling topics over time.
This happens by fitting a KeyNMF model first on the entire corpus, then
fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices.

1. Compute keyword matrix $M$ for the whole corpus.
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
3. For each time slice $t$:
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:
??? info "Click to see formula"

$$
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
$$
1. Compute keyword matrix $M$ for the whole corpus.
2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$.
3. For each time slice $t$:
1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$.
2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$:

$$
H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2
$$

Here's an example of using KeyNMF in a dynamic modeling setting:

Expand Down Expand Up @@ -200,12 +205,7 @@ model.print_topics_over_time()
| - | - | - | - | - | - |
| 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn |
| 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player |
| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave |
| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd |
| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape |
| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape |
| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers |
| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence |
| | | ... | | | |
| 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona |
| 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored |

Expand All @@ -225,11 +225,11 @@ model.plot_topics_over_time(top_k=5)
```

<figure>
<img src="../images/dynamic_keynmf.png" width="80%" style="margin-left: auto;margin-right: auto;">
<img src="../images/dynamic_keynmf.png" width="50%" style="margin-left: auto;margin-right: auto;">
<figcaption>Topics over time on a Figure</figcaption>
</figure>

## Online Topic Modeling
### Online Topic Modeling

KeyNMF can also be fitted in an online manner.
This is done by fitting NMF with batches of data instead of the whole dataset at once.
Expand Down Expand Up @@ -354,22 +354,22 @@ for batch in batched(zip(corpus, timestamps)):
model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins)
```

## Hierarchical Topic Modeling
### Hierarchical Topic Modeling

When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy.

This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic.
In other words:

1. Decompose keyword matrix $M \approx WH$
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
2. Perform multiplicative updates until convergence. <br>
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$

??? info "Click to see formula"
1. Decompose keyword matrix $M \approx WH$
2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$.
3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$
1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly.
2. Perform multiplicative updates until convergence. <br>
$\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$ <br>
$\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$
4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$:
1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$

To create a hierarchical model, you can use the `hierarchy` property of the model.

Expand All @@ -395,20 +395,6 @@ print(model.hierarchy)

For a detailed tutorial on hierarchical modeling click [here](hierarchical.md).

## Considerations

### Strengths

- Stability, Robustness and Quality: KeyNMF extracts very clean topics even when a lot of noise is present in the corpus, and the model's performance remains relatively stable across domains.
- Scalability: The model can be fitted in an online fashion, and we recommend that you choose KeyNMF when the number of documents is large (over 100 000).
- Fail Safe and Adjustable: Since the modelling process consists of multiple easily separable steps it is easy to repeat one if something goes wrong. This also makes it an ideal choice for production usage.
- Can capture multiple topics in a document.

### Weaknesses

- Lack of Nuance: Since only the top K keywords are considered and used for topic extraction some of the nuances, especially in long texts might get lost. We therefore recommend that you scale K with the average length of the texts you're working with. For tweets it might be worth it to scale it down to 5, while with longer documents, a larger number (let's say 50) might be advisable.
- Practitioners have to choose the number of topics a priori.

## API Reference

::: turftopic.models.keynmf.KeyNMF
Loading

0 comments on commit cbf202e

Please sign in to comment.