From cbf202e095ca6e6046bfa9c95eca323ebd13c05a Mon Sep 17 00:00:00 2001 From: =?UTF-8?q?M=C3=A1rton=20Kardos?= Date: Tue, 22 Oct 2024 11:23:03 +0200 Subject: [PATCH] Updated docs --- docs/GMM.md | 71 ++++------ docs/KeyNMF.md | 144 +++++++++---------- docs/basics.md | 12 ++ docs/clustering.md | 304 ++++++++++++++++++++++++----------------- docs/ctm.md | 14 -- docs/model_overview.md | 152 --------------------- 6 files changed, 280 insertions(+), 417 deletions(-) diff --git a/docs/GMM.md b/docs/GMM.md index 89b941e..6fcc653 100644 --- a/docs/GMM.md +++ b/docs/GMM.md @@ -9,47 +9,51 @@ These Gaussian components are assumed to be the topics.
Components of a Gaussian Mixture Model
(figure from scikit-learn documentation)
-## The Model +## How does GMM work? -### 1. Generative Modeling - -GMM assumes that the embeddings are generated according to the following stochastic process: - -1. Select global topic weights: $\Theta$ -2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ . -3. For each document: - - Draw topic label: $z \sim Categorical(\Theta)$ - - Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$ +### Generative Modeling +GMM assumes that the embeddings are generated according to the following stochastic process from a number of Gaussian components. Priors are optionally imposed on the model parameters. The model is fitted either using expectation maximization or variational inference. -### 2. Topic Inference over Documents +??? info "Click to see formula" + 1. Select global topic weights: $\Theta$ + 2. For each component select mean $\mu_z$ and covariance matrix $\Sigma_z$ . + 3. For each document: + - Draw topic label: $z \sim Categorical(\Theta)$ + - Draw document vector: $\rho \sim \mathcal{N}(\mu_z, \Sigma_z)$ + + +### Calculate Topic Probabilities After the model is fitted, soft topic labels are inferred for each document. A document-topic-matrix ($T$) is built from the likelihoods of each component given the document encodings. -Or in other words for document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$ +??? info "Click to see formula" + - For document $i$ and topic $z$ the matrix entry will be: $T_{iz} = p(\rho_i|\mu_z, \Sigma_z)$ ### 3. Soft c-TF-IDF Term importances for the discovered Gaussian components are estimated post-hoc using a technique called __Soft c-TF-IDF__, an extension of __c-TF-IDF__, that can be used with continuous labels. -Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$. -Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner: +??? info "Click to see formula" + + Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$. + Soft Class-based tf-idf scores for terms in a topic are then calculated in the following manner: -- Estimate weight of term $j$ for topic $z$:
-$tf_{zj} = \frac{t_{zj}}{w_z}$, where -$t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and -$w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$
-- Estimate inverse document/topic frequency for term $j$: -$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where -$N$ is the total number of documents. -- Calculate importance of term $j$ for topic $z$: -$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ + - Estimate weight of term $j$ for topic $z$:
+ $tf_{zj} = \frac{t_{zj}}{w_z}$, where + $t_{zj} = \sum_i T_{iz} \cdot X_{ij}$ and + $w_{z}= \sum_i(|T_{iz}| \cdot \sum_j X_{ij})$
+ - Estimate inverse document/topic frequency for term $j$: + $idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where + $N$ is the total number of documents. + - Calculate importance of term $j$ for topic $z$: + $Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ -### _(Optional)_ 4. Dynamic Modeling +### Dynamic Modeling GMM is also capable of dynamic topic modeling. This happens by fitting one underlying mixture model over the entire corpus, as we expect that there is only one semantic model generating the documents. To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF for each of the time slices separately. @@ -90,25 +94,6 @@ from sklearn.decomposition import IncrementalPCA model = GMM(20, dimensionality_reduction=IncrementalPCA(20)) ``` -## Considerations - -### Strengths - - - Efficiency, Stability: GMM relies on a rock solid implementation in scikit-learn, you can rest assured that the model will be fast and reliable. - - Coverage of Ingroup Variance: The model is very efficient at describing the extracted topics in all their detail. - This means that the topic descriptions will typically cover most of the documents generated from the topic fairly well. - - Uncertainty: GMM is capable of expressing and modeling uncertainty around topic labels for documents. - - Dynamic Modeling: You can model changes in topics over time using GMM. - -### Weaknesses - - - Curse of Dimensionality: The dimensionality of embeddings can vary wildly from model to model. High-dimensional embeddings might decrease the efficiency and performance of GMM, as it is sensitive to the curse of dimensionality. Dimensionality reduction can help mitigate these issues. - - Assumption of Gaussianity: The model assumes that topics are Gaussian components, it might very well be that this is not the case. - Fortunately enough this rarely effects real-world perceived performance of models, and typically does not present an issue in practical settings. - - Moderate Scalability: While the model is scalable to a certain extent, it is not nearly as scalable as some of the other options. If you experience issues with computational efficiency or convergence, try another model. - - Moderate Robustness to Noise: GMM is similarly sensitive to noise and stop words as BERTopic, and can sometimes find noise components. Our experience indicates that GMM is way less volatile, and the quality of the results is more reliable than with clustering models using C-TF-IDF. - - ## API Reference ::: turftopic.models.gmm.GMM diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md index cf5f428..01535f7 100644 --- a/docs/KeyNMF.md +++ b/docs/KeyNMF.md @@ -19,42 +19,42 @@ model.fit(corpus) model.print_topics() ``` -## Keyword Extraction +## How does KeyNMF work? -The first step of the process is gaining enhanced representations of documents by using contextual embeddings. -Both the documents and the vocabulary get encoded with the same sentence encoder. -Keywords are assigned to each document based on the cosine similarity of the document embedding to the embedded words in the document. -Only the top K words with positive cosine similarity to the document are kept. -These keywords are then arranged into a document-term importance matrix where each column represents a keyword that was encountered in at least one document, -and each row is a document. The entries in the matrix are the cosine similarities of the given keyword to the document in semantic space. +### Keyword Extraction -- For each document $d$: - 1. Let $x_d$ be the document's embedding produced with the encoder model. - 2. For each word $w$ in the document $d$: - 1. Let $v_w$ be the word's embedding produced with the encoder model. - 2. Calculate cosine similarity between word and document +KeyNMF discovers topics based on the importances of keywords for a given document. +This is done by embedding words in a document, and then extracting the cosine similarities of documents to words using a transformer-model. +Only the `top_n` keywords with positive similarity are kept. - $$ - \text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||} - $$ +??? info "Click to see formula" + - For each document $d$: + 1. Let $x_d$ be the document's embedding produced with the encoder model. + 2. For each word $w$ in the document $d$: + 1. Let $v_w$ be the word's embedding produced with the encoder model. + 2. Calculate cosine similarity between word and document + + $$ + \text{sim}(d, w) = \frac{x_d \cdot v_w}{||x_d|| \cdot ||v_w||} + $$ - 3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$. + 3. Let $K_d$ be the set of $N$ keywords with the highest cosine similarity to document $d$. - $$ - K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where } - |K_d| = N\text{, and } \\ - w \in d - $$ + $$ + K_d = \text{argmax}_{K^*} \sum_{w \in K^*}\text{sim}(d,w)\text{, where } + |K_d| = N\text{, and } \\ + w \in d + $$ -- Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords. + - Arrange positive keyword similarities into a keyword matrix $M$ where the rows represent documents, and columns represent unique keywords. - $$ - M_{dw} = - \begin{cases} - \text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\ - 0, & \text{otherwise}. - \end{cases} - $$ + $$ + M_{dw} = + \begin{cases} + \text{sim}(d,w), & \text{if } w \in K_d \text{ and } \text{sim}(d,w) > 0 \\ + 0, & \text{otherwise}. + \end{cases} + $$ You can do this step manually if you want to precompute the keyword matrix. Keywords are represented as dictionaries mapping words to keyword importances. @@ -78,19 +78,22 @@ keyword_matrix = model.extract_keywords(corpus) model.fit(None, keywords=keyword_matrix) ``` -## Topic Discovery +### Topic Discovery Topics in this matrix are then discovered using Non-negative Matrix Factorization. Essentially the model tries to discover underlying dimensions/factors along which most of the variance in term importance can be explained. -- Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss: +??? info "Click to see formula" + + - Decompose $M$ with non-negative matrix factorization: $M \approx WH$, where $W$ is the document-topic matrix, and $H$ is the topic-term matrix. Non-negative Matrix Factorization is done with the coordinate-descent algorithm, minimizing square loss: + + $$ + L(W,H) = ||M - WH||^2 + $$ - $$ - L(W,H) = ||M - WH||^2 - $$ + You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords. -You can fit KeyNMF on the raw corpus, with precomputed embeddings or with precomputed keywords. ```python # Fitting just on the corpus model.fit(corpus) @@ -109,7 +112,7 @@ keyword_matrix = model.extract_keywords(corpus) model.fit(None, keywords=keyword_matrix) ``` -## Asymmetric and Instruction-tuned Embedding Models +### Asymmetric and Instruction-tuned Embedding Models Some embedding models can be used together with prompting, or encode queries and passages differently. This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings. @@ -154,21 +157,23 @@ model = KeyNMF(10, encoder=encoder) Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with. -## Dynamic Topic Modeling +### Dynamic Topic Modeling KeyNMF is also capable of modeling topics over time. This happens by fitting a KeyNMF model first on the entire corpus, then fitting individual topic-term matrices using coordinate descent based on the document-topic and document-term matrices in the given time slices. -1. Compute keyword matrix $M$ for the whole corpus. -2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$. -3. For each time slice $t$: - 1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$. - 2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$: +??? info "Click to see formula" - $$ - H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2 - $$ + 1. Compute keyword matrix $M$ for the whole corpus. + 2. Decompose $M$ with non-negative matrix factorization: $M \approx WH$. + 3. For each time slice $t$: + 1. Let $W_t$ be the document-topic proportions for documents in time slice $t$, and $M_t$ be the keyword matrix for words in time slice $t$. + 2. Obtain the topic-term matrix for the time slice, by minimizing square loss using coordinate descent and fixing $W_t$: + + $$ + H_t = \text{argmin}_{H^{*}} ||M_t - W_t H^{*}||^2 + $$ Here's an example of using KeyNMF in a dynamic modeling setting: @@ -200,12 +205,7 @@ model.print_topics_over_time() | - | - | - | - | - | - | | 2012 12 06 - 2013 11 10 | genocide, yugoslavia, karadzic, facts, cnn | cnn, russia, chechnya, prince, merkel | france, cnn, francois, hollande, bike | tennis, tournament, wimbledon, grass, courts | beckham, soccer, retired, david, learn | | 2013 11 10 - 2014 10 14 | keith, stones, richards, musician, author | georgia, russia, conflict, 2008, cnn | civil, rights, hear, why, should | cnn, kidneys, traffickers, organ, nepal | ronaldo, cristiano, goalscorer, soccer, player | -| 2014 10 14 - 2015 09 18 | ethiopia, brew, coffee, birthplace, anderson | climate, sutter, countries, snapchat, injustice | women, guatemala, murder, country, worst | cnn, climate, oklahoma, women, topics | sweden, parental, dads, advantage, leave | -| 2015 09 18 - 2016 08 22 | snow, ice, winter, storm, pets | climate, crisis, drought, outbreaks, syrian | women, vulnerabilities, frontlines, countries, marcelas | cnn, warming, climate, sutter, theresa | sutter, band, paris, fans, crowd | -| 2016 08 22 - 2017 07 26 | derby, epsom, sporting, race, spectacle | overdoses, heroin, deaths, macron, emmanuel | fear, died, indigenous, people, arthur | siblings, amnesia, palombo, racial, mh370 | bobbi, measles, raped, camp, rape | -| 2017 07 26 - 2018 06 30 | her, percussionist, drums, she, deported | novichok, hurricane, hospital, deaths, breathing | women, day, celebrate, taliban, international | abuse, harassment, cnn, women, pilgrimage | maradona, argentina, history, jadon, rape | -| 2018 06 30 - 2019 06 03 | athletes, teammates, celtics, white, racism | pope, archbishop, francis, vigano, resignation | racism, athletes, teammates, celtics, white | golf, iceland, volcanoes, atlantic, ocean | rape, sudanese, racist, women, soldiers | -| 2019 06 03 - 2020 05 07 | esports, climate, ice, racers, culver | esports, coronavirus, pandemic, football, teams | racers, women, compete, zone, bery | serena, stadium, sasha, final, naomi | kobe, bryant, greatest, basketball, influence | +| | | ... | | | | | 2020 05 07 - 2021 04 10 | olympics, beijing, xinjiang, ioc, boycott | covid, vaccine, coronavirus, pandemic, vaccination | olympic, japan, medalist, canceled, tokyo | djokovic, novak, tennis, federer, masterclass | ronaldo, cristiano, messi, juventus, barcelona | | 2021 04 10 - 2022 03 16 | olympics, tokyo, athletes, beijing, medal | covid, pandemic, vaccine, vaccinated, coronavirus | olympic, athletes, ioc, medal, athlete | djokovic, novak, tennis, wimbledon, federer | ronaldo, cristiano, messi, manchester, scored | @@ -225,11 +225,11 @@ model.plot_topics_over_time(top_k=5) ```
- +
Topics over time on a Figure
-## Online Topic Modeling +### Online Topic Modeling KeyNMF can also be fitted in an online manner. This is done by fitting NMF with batches of data instead of the whole dataset at once. @@ -354,22 +354,22 @@ for batch in batched(zip(corpus, timestamps)): model.partial_fit_dynamic(text_batch, timestamps=ts_batch, bins=bins) ``` -## Hierarchical Topic Modeling +### Hierarchical Topic Modeling When you suspect that subtopics might be present in the topics you find with the model, KeyNMF can be used to discover topics further down the hierarchy. This is done by utilising a special case of **weighted NMF**, where documents are weighted by how high they score on the parent topic. -In other words: - -1. Decompose keyword matrix $M \approx WH$ -2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$. -3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$ - 1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly. - 2. Perform multiplicative updates until convergence.
- $\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$
- $\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$ -4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$: - 1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$ + +??? info "Click to see formula" + 1. Decompose keyword matrix $M \approx WH$ + 2. To find subtopics in topic $j$, define document weights $w$ as the $j$th column of $W$. + 3. Estimate subcomponents with **wNMF** $M \approx \mathring{W} \mathring{H}$ with document weight $w$ + 1. Initialise $\mathring{H}$ and $\mathring{W}$ randomly. + 2. Perform multiplicative updates until convergence.
+ $\mathring{W}^T = \mathring{W}^T \odot \frac{\mathring{H} \cdot (M^T \odot w)}{\mathring{H} \cdot \mathring{H}^T \cdot (\mathring{W}^T \odot w)}$
+ $\mathring{H}^T = \mathring{H}^T \odot \frac{ (M^T \odot w)\cdot \mathring{W}}{\mathring{H}^T \cdot (\mathring{W}^T \odot w) \cdot \mathring{W}}$ + 4. To sufficiently differentiate the subcomponents from each other a pseudo-c-tf-idf weighting scheme is applied to $\mathring{H}$: + 1. $\mathring{H} = \mathring{H}_{ij} \odot ln(1 + \frac{A}{1+\sum_k \mathring{H}_{kj}})$, where $A$ is the average of all elements in $\mathring{H}$ To create a hierarchical model, you can use the `hierarchy` property of the model. @@ -395,20 +395,6 @@ print(model.hierarchy) For a detailed tutorial on hierarchical modeling click [here](hierarchical.md). -## Considerations - -### Strengths - - - Stability, Robustness and Quality: KeyNMF extracts very clean topics even when a lot of noise is present in the corpus, and the model's performance remains relatively stable across domains. - - Scalability: The model can be fitted in an online fashion, and we recommend that you choose KeyNMF when the number of documents is large (over 100 000). - - Fail Safe and Adjustable: Since the modelling process consists of multiple easily separable steps it is easy to repeat one if something goes wrong. This also makes it an ideal choice for production usage. - - Can capture multiple topics in a document. - -### Weaknesses - - - Lack of Nuance: Since only the top K keywords are considered and used for topic extraction some of the nuances, especially in long texts might get lost. We therefore recommend that you scale K with the average length of the texts you're working with. For tweets it might be worth it to scale it down to 5, while with longer documents, a larger number (let's say 50) might be advisable. - - Practitioners have to choose the number of topics a priori. - ## API Reference ::: turftopic.models.keynmf.KeyNMF diff --git a/docs/basics.md b/docs/basics.md index 8c52228..66d7f0f 100644 --- a/docs/basics.md +++ b/docs/basics.md @@ -236,6 +236,18 @@ latex_table: str = model.export_topics(format="latex") md_table: str = model.export_representative_documents(0, corpus, document_topic_matrix, format="markdown") ``` +### Naming topics + +You can manually name topics in Turftopic models after having interpreted them. +If you find a more fitting name for a topic, feel free to rename it in your model. + +```python +from turftopic import SemanticSignalSeparation + +model = SemanticSignalSeparation(10).fit(corpus) +model.rename_topics({0: "New name for topic 0", 5: "New name for topic 5"}) +``` + ### Visualization Turftopic does not come with built-in visualization utilities, [topicwizard](https://github.com/x-tabdeveloping/topicwizard), a package for interactive topic model interpretation is fully compatible with Turftopic models. diff --git a/docs/clustering.md b/docs/clustering.md index 3b00a0a..bab6de9 100644 --- a/docs/clustering.md +++ b/docs/clustering.md @@ -10,94 +10,133 @@ while sticking to a minimal amount of extra dependencies. While the models themselves can be equivalent to BERTopic and Top2Vec implementations, Turftopic might not offer some of the implementation-specific features, that the other libraries boast. -## The Model +## How do clustering models work? -### 1. Dimensionality Reduction +### Dimensionality Reduction -It is common practice in clustering topic modeling literature to reduce the dimensionality of the embeddings before clustering them. -This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by. +```python +from sklearn.manifold import TSNE +from turftopic import ClusteringTopicModel -Dimensionality reduction by default is done with scikit-learn's TSNE implementation in Turftopic, +model = ClusteringTopicModel(clustering=TSNE()) +``` + +It is common practice to reduce the dimensionality of the embeddings before clustering them. +This is to avoid the curse of dimensionality, an issue, which many clustering models are affected by. +Dimensionality reduction by default is done with scikit-learn's **TSNE** implementation in Turftopic, but users are free to specify the model that will be used for dimensionality reduction. -Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature. -Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed). +??? note "What reduction model should I choose?" + Our knowledge about the impacts of choice of dimensionality reduction is limited, and has not yet been explored in the literature. + Top2Vec and BERTopic both use UMAP, which has a number of desirable properties over alternatives (arranging data points into cluster-like structures, better preservation of global structure than TSNE, speed). -### 2. Clustering +### Clustering -After reducing the dimensionality of the embeddings, they are clustered with a clustering model. -As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses OPTICS as its default. +```python +from sklearn.cluster import OPTICS +from turftopic import ClusteringTopicModel + +model = ClusteringTopicModel(clustering=OPTICS()) +``` -Some clustering models are capable of discovering the number of clusters in the data. -This is a useful and yet-to-be challenged property of clustering topic models. +After reducing the dimensionality of the embeddings, they are clustered with a clustering model. +As HDBSCAN has only been part of scikit-learn since version 1.3.0, Turftopic uses **OPTICS** as its default. -Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation. -Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown. +??? note "What clustering model should I choose?" + Some clustering models are capable of discovering the number of clusters in the data (HDBSCAN, DBSCAN, OPTICS, etc.). + Practice suggests, however, that in large corpora, this frequently results in a very large number of topics, which is impractical for interpretation. + Models' hyperparameters can be adjusted to account for this behaviour, but the impact of choice of hyperparameters on topic quality is more or less unknown. + You can also use models that have predefined numbers of clusters, these, however, typically produce lower topic quality (e.g. KMeans) -### 3a. Term Importance: Proximity to Cluster Centroids +### Term importance Clustering topic models rely on post-hoc term importance estimation. -Currently there are two methods used for this. +Multiple methods can be used for this in Turftopic. + +!!! failure inline end "Weaknesses" + - Topics can be too specific => low within-topic coverage + - Assumes spherical clusters => could give incorrect results + +!!! success inline end "Strengths" + - Clean topics + - Highly specific topics + +#### Proximity to Cluster Centroids + The solution introduced in Top2Vec (Angelov, 2020) is that of estimating terms' importances for a given topic from their embeddings' cosine similarity to the centroid of the embeddings in a cluster. -
- -
Terms Close to the Topic Vector
(figure from Top2Vec documentation)
-
+```python +from turftopic import ClusteringTopicModel + +model = ClusteringTopicModel(feature_importance="centroid") +``` -This has three implications: -1. Topic descriptions are very specific. As the closest terms to the topic vector are selected, they tend to also be very close to each other. - The issue with this is that many of the documents in a topic might not get proper coverage. -2. It is assumed that the clusters are convex and spherical. This might not at all be the case, and especially when clusters are concave, - the closest terms to the centroid might end up describing a different, or nonexistent topic. - In other words: The mean might not be a representative datapoint of the population. -3. Noise rarely gets into topic descriptions. Since functions words or contaminating terms are not very likely to be closest to the topic vector, - decriptions are typically clean. +!!! failure inline end "Weaknesses" + - Topics can be contaminated with stop words + - Lower topic quality -
- -
Centroids of Non-Convex Clusters
-
+!!! success inline end "Strengths" + - Theoretically correct + - More within-topic coverage -### 3b. Term Importance: c-TF-IDF +#### c-TF-IDF -The solution to this issue, suggested by Grootendorst (2022) to this issue was c-TF-IDF. -c-TF-IDF is a weighting scheme based on the number of occurrences of terms in each cluster. +c-TF-IDF (Grootendorst, 2022) is a weighting scheme based on the number of occurrences of terms in each cluster. Terms which frequently occur in other clusters are inversely weighted so that words, which are specific to a topic gain larger importance. -Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$. +```python +from turftopic import ClusteringTopicModel + +model = ClusteringTopicModel(feature_importance="soft-c-tf-idf") +# or +model = ClusteringTopicModel(feature_importance="c-tf-idf") +``` + -By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF, which is calculated in the following manner: +By default, Turftopic uses a modified version of c-TF-IDF, called Soft-c-TF-IDF. -- Estimate weight of term $j$ for topic $z$:
-$tf_{zj} = \frac{t_{zj}}{w_z}$, where -$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and -$w_{z}= \sum_{j} t_{zj}$ is all words in the topic
-- Estimate inverse document/topic frequency for term $j$: -$idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where -$N$ is the total number of documents. -- Calculate importance of term $j$ for topic $z$: -$Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ +??? info "Click to see formula" + - Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$. + - Estimate weight of term $j$ for topic $z$:
+ $tf_{zj} = \frac{t_{zj}}{w_z}$, where + $t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and + $w_{z}= \sum_{j} t_{zj}$ is all words in the topic
+ - Estimate inverse document/topic frequency for term $j$: + $idf_j = log(\frac{N}{\sum_z |t_{zj}|})$, where + $N$ is the total number of documents. + - Calculate importance of term $j$ for topic $z$: + $Soft-c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ You can also use the original c-TF-IDF formula, if you intend to replicate the behaviour of BERTopic exactly. The two formulas tend to give similar results, though the implications of choosing one over the other has not been thoroughly evaluated. -$tf_{zj} = \frac{t_{zj}}{w_z}$, where -$t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and -$w_{z}= \sum_{j} t_{zj}$ is all words in the topic
-- Estimate inverse document/topic frequency for term $j$: -$idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})$, where -$A = \frac{\sum_z \sum_j t_{zj}}{Z}$ is the average number of words per topic, and $Z$ is the number of topics. -- Calculate importance of term $j$ for topic $z$: -$c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ +??? info "Click to see formula" + - Let $X$ be the document term matrix where each element ($X_{ij}$) corresponds with the number of times word $j$ occurs in a document $i$. + - $tf_{zj} = \frac{t_{zj}}{w_z}$, where + $t_{zj} = \sum_{i \in z} X_{ij}$ is the number of occurrences of a word in a topic and + $w_{z}= \sum_{j} t_{zj}$ is all words in the topic
+ - Estimate inverse document/topic frequency for term $j$: + $idf_j = log(1 + \frac{A}{\sum_z |t_{zj}|})$, where + $A = \frac{\sum_z \sum_j t_{zj}}{Z}$ is the average number of words per topic, and $Z$ is the number of topics. + - Calculate importance of term $j$ for topic $z$: + $c-TF-IDF{zj} = tf_{zj} \cdot idf_j$ + +#### Recalculating Term Importance -This solution is generally to be preferred to centroid-based term importance (and the default in Turftopic), as it is more likely to give correct results. -On the other hand, c-TF-IDF can be sensitive to words with atypical statistical properties (stop words), and can result in low diversity between topics, when clusters are joined post-hoc. +You can also choose to recalculate term importances with a different method after fitting the model: -### 4. Hierarchical Topic Merging +```python +from turftopic import ClusteringTopicModel + +model = ClusteringTopicModel().fit(corpus) +model.estimate_components(feature_importance="centroid") +model.estimate_components(feature_importance="soft-c-tf-idf") +``` + +### Hierarchical Topic Merging A weakness of clustering approaches based on density-based clustering methods, is that all too frequently they find a very large number of topics. To limit the number of topics in a topic model you can use hierarchical topic merging. @@ -122,97 +161,104 @@ You can do this in Turftopic as well: model = ClusteringTopicModel(n_reduce_to=10, reduction_method="agglomerative") ``` -## BERTopic and Top2Vec - -Turftopic's implementation differs in multiple places to BERTopic and Top2Vec. -You can, however, construct models in Turftopic that imitate the behaviour of these other packages. - -The main differences to these packages are: - - Dimensionality reduction in BERTopic and Top2Vec is done with UMAP. - - Clustering is in BERTopic and Top2Vec is done with HDBSCAN. - - Turftopic does not include many of the visualization and model-specific utilities that BERTopic does. +You can also merge topics after having run the models using the `reduce_topics()` method. -To get closest to the functionality of the two other packages you can manually set the clustering and dimensionality reduction model when creating the models: - -You will need UMAP and scikit-learn>=1.3.0: - -```bash -pip install umap-learn scikit-learn>=1.3.0 +```python +model = ClusteringTopicModel().fit(corpus) +model.reduce_topics(n_reduce_to=20, reduction_method="smallest") ``` -This is how you build a BERTopic-like model in Turftopic: +To reset topics to the original clustering, use the `reset_topics()` method: ```python -from turftopic import ClusteringTopicModel -from sklearn.cluster import HDBSCAN -import umap - -# I also included the default parameters of BERTopic so that the behaviour is as -# close as possible -berttopic = ClusteringTopicModel( - dimensionality_reduction=umap.UMAP( - n_neighbors=10, - n_components=5, - min_dist=0.0, - metric="cosine", - ), - clustering=HDBSCAN( - min_cluster_size=15, - metric="euclidean", - cluster_selection_method="eom", - ), - feature_importance="c-tf-idf", - reduction_method="agglomerative" -) +model.reset_topics() ``` -This is how you build a Top2Vec model in Turftopic: +### Manual Topic Merging + +You can also manually merge topics using the `join_topics()` method. ```python -top2vec = ClusteringTopicModel( - dimensionality_reduction=umap.UMAP( - n_neighbors=15, - n_components=5, - metric="cosine" - ), - clustering=HDBSCAN( - min_cluster_size=15, - metric="euclidean", - cluster_selection_method="eom", - ), - feature_importance="centroid", - reduction_method="smallest" -) +model = ClusteringTopicModel() +model.fit(texts, embeddings=embeddings) +# This joins topics 0, 1, 2 to be cluster 0 +model.join_topics([0, 1, 2]) ``` -Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. -We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely. +### How do I use BERTopic and Top2Vec in Turftopic? -### _(Optional)_ 5. Dynamic Modeling +You can create BERTopic and Top2Vec models in Turftopic by modifying all model parameters and hyperparameters to match the defaults in those other packages. -Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents. -To gain temporal representations for topics, the corpus is divided into equal, or arbitrarily chosen time slices, and then term importances are estimated using Soft-c-TF-IDF, c-TF-IDF, or distances from cluster centroid for each of the time slices separately. When distance from cluster centroids is used to estimate topic importances in dynamic modeling, cluster centroids are computed based on documents and terms present within a given time slice. +You will need UMAP and scikit-learn>=1.3.0 to be able to use HDBSCAN and UMAP: +```bash +pip install umap-learn scikit-learn>=1.3.0 +``` -## Considerations +#### BERTopic + +You will need to set the clustering model to HDBSCAN and dimensionality reduction to UMAP. +BERTopic also uses the original c-tf-idf formula and agglomerative topic joining. + +??? info "Show code" + + ```python + from turftopic import ClusteringTopicModel + from sklearn.cluster import HDBSCAN + import umap + + berttopic = ClusteringTopicModel( + dimensionality_reduction=umap.UMAP( + n_neighbors=10, + n_components=5, + min_dist=0.0, + metric="cosine", + ), + clustering=HDBSCAN( + min_cluster_size=15, + metric="euclidean", + cluster_selection_method="eom", + ), + feature_importance="c-tf-idf", + reduction_method="agglomerative" + ) + ``` + +#### Top2Vec + +You will need to set the clustering model to HDBSCAN and dimensionality reduction to UMAP. +Top2Vec uses `centroid` feature importance and `smallest` topic merging method. + +??? info "Show code" + ```python + top2vec = ClusteringTopicModel( + dimensionality_reduction=umap.UMAP( + n_neighbors=15, + n_components=5, + metric="cosine" + ), + clustering=HDBSCAN( + min_cluster_size=15, + metric="euclidean", + cluster_selection_method="eom", + ), + feature_importance="centroid", + reduction_method="smallest" + ) + ``` -### Strengths +Theoretically the model descriptions above should result in the same behaviour as the other two packages, but there might be minor changes in implementation. +We do not intend to keep up with changes in Top2Vec's and BERTopic's internal implementation details indefinitely. - - Automatic Discovery of Number of Topics: Clustering models can find the number of topics by themselves. This is a useful quality of these models as practicioners can rarely make an informed decision about the number of topics a-priori. - - No Assumptions of Normality: With clustering models you can avoid making assumptions about cluster shapes. This is in contrast with GMMs, which assume topics to be Gaussian components. - - Outlier Detection: OPTICS, HDBSCAN or DBSCAN contain outlier detection. This way, outliers do not influence topic representations. - - Not Affected by Embedding Size: Since the models include dimensionality reduction, they are not as influenced by the curse of dimensionality as other methods. +### Dynamic Modeling -### Weaknesses +Clustering models are also capable of dynamic topic modeling. This happens by fitting a clustering model over the entire corpus, as we expect that there is only one semantic model generating the documents. - - Scalability: Clustering models typically cannot be fitted in an online fashion, and manifold learning is usually inefficient in large corpora. When the number of texts is huge, the number of topics also gets inflated, which is impractical for interpretation. - - Lack of Nuance: The models are unable to capture multiple topics in a document or capture uncertainty around topic labels. This makes the models impractical for longer texts as well. - - Sensitivity to Hyperparameters: While do not have to set the number of topics directly, the hyperparameters you choose has a huge impact on the number of topics you will end up getting. You can counteract this to a certain extent with hierarchical merging. (see figure) - - Transductivity: Some clustering methods are transductive, meaning you can't predict topical content for new documents, as they would change the cluster structure. +```python +from turftopic import ClusteringTopicModel -
- -
Effect of UMAP's and HDBSCAN's Hyperparameters on the Number of Topics in 20 Newsgroups
-
+model = ClusteringTopicModel().fit_dynamic(corpus, timestamps=ts, bins=10) +model.print_topics_over_time() +``` ## API Reference diff --git a/docs/ctm.md b/docs/ctm.md index 77bbbe7..9b5d3bb 100644 --- a/docs/ctm.md +++ b/docs/ctm.md @@ -56,20 +56,6 @@ This has a number of implications, most notably: Turftopic, similarly to Clustering models might not contain some model specific utilites, that CTM boasts. -## Considerations - -### Strengths - - - Topic Proportions: Autoencoding models can capture multiple topics in a document and can therefore capture nuances that other models might not be able to. - - Online Learning: You can fit these models in an online way as they use minibatch learning. (WARNING: This is not yet implemented in Turftopic) - -### Weaknesses - - - Low Quality and Sensitivity to Noise: The quality of topics tends to be lower than with other models. Noise might get into topic description, there might be overlap between topics, and there might be topics that are hard to interpret. Other models typically outperform autoencoding models. - - Curse of Dimensionality: The number of parameters in these models is typically very high. This makes inference difficult and might result in poor convergence, and very slow inference. - - Black Box: Since the mapping to parameter space is learned by a neural network, the model is very black box in nature and it's hard to know why and what it learns. - - ## API Reference ::: turftopic.models.ctm.AutoEncodingTopicModel diff --git a/docs/model_overview.md b/docs/model_overview.md index ba08c6a..9406de5 100644 --- a/docs/model_overview.md +++ b/docs/model_overview.md @@ -3,8 +3,6 @@ In any use case it is important that practicioners understand the implications of their choices. This page is dedicated to giving an overview of the models in the package, so you can find the right one for your particular application. -## Theory - ### What is a topic? Models in Turftopic provide answers to this question that can at large be assigned into two categories: @@ -61,156 +59,6 @@ Term importances in different models are calculated differently. 1. Some models (KeyNMF, Autoencoding) __infer__ term importances, as they are model parameters. 2. Other models (GMM, Clustering, $S^3$) use __post-hoc__ measures for determining term importance. - -## Performance - -Here's a table with the models' performance on a number of quantitative metrics with the `all-MiniLM-L6-v2` embedding model. -Results were obtained with the `topic-benchmark` package. - -
- - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - - -
20 Newsgroups RawBBC NewsArXiv ML Papers
CNPMIDiversityWECinWECexCNPMIDiversityWECinWECexCNPMIDiversityWECinWECex
KeyNMF0.1480.8980.5310.2730.0730.9070.9250.302-0.0100.7560.8210.172
-0.2110.9900.4490.278-0.2920.9230.9230.237-0.3200.9430.9070.205
Top2Vec (Clustering)-0.1370.9490.4860.373-0.2960.7800.9310.264-0.2660.4660.8440.166
BERTopic (Clustering)0.0410.5320.4160.196-0.0100.5000.6090.256-0.0100.3540.5710.189
CombinedTM (Autoencoding)-0.0380.8830.4010.180-0.0280.9050.8590.161-0.0580.8080.7440.132
ZeroShotTM (Autoencoding)-0.0160.8900.4460.183-0.0180.8220.8280.174-0.0620.7670.7540.130
- -_Model Comparison on 3 Corpora: Best bold, second best underlined_ -
- -### 1. When in doubt **use KeyNMF**. - -When you can't make an informed decision about which model is optimal for your use case, or you just want to get your hands dirty with topic modeling, -KeyNMF is by far the best option. -It is very stable, gives high quality topics, and is incredibly robust to noise. -It is also the closest to classical topic models and thus conforms to your intuition about topic modeling. - -Another advantage is that KeyNMF is the most scalable and fail-safe option, meaning that you can use it on enormous corpora. - -### 2. Short Texts - **use Clustering or GMM** - -On tweets and short texts in general, making the assumption that a document only contains one topic is very reasonable. -Clustering models and GMM are very good in this context and should be preferred over other options. - -### 3. Want to understand variation? **use $S^3$** - -$S^3$ is by far the best model to explain variations in semantics. -If you are looking for a model that can help you establish a theory of semantics in a corpus, $S^3$ is an excellent choice. - -### 4. Avoid using Autoencoding Models. - -In my anecdotal experience and all experiments I've done with topic models, Autoencoding Models were consistently outclassed by all else, -and their behaviour is also incredbly opaque. -Convergence issues or overlapping topics are a common occurrence. And as such, unless you have reasons to do so I would recommend that your first choice is another model on the list. - ## API Reference :::turftopic.base.ContextualModel