Implement Dynamic modeling for $S^3$ #67

x-tabdeveloping · 2024-10-23T13:54:03Z

Since ICA is fully linear, you just have to fit a linear regression over time slices to get time-slice-specific topics.
Here's some code, that already works pretty well:

from datetime import datetime
from typing import Optional, Union

import numpy as np
import pandas as pd
from sentence_transformers import SentenceTransformer
from sklearn.linear_model import LinearRegression

from turftopic import SemanticSignalSeparation
from turftopic.dynamic import DynamicTopicModel


class DynamicS3(DynamicTopicModel, SemanticSignalSeparation):
    def fit_transform_dynamic(
        self,
        raw_documents,
        timestamps: list[datetime],
        embeddings: Optional[np.ndarray] = None,
        bins: Union[int, list[datetime]] = 10,
    ) -> np.ndarray:
        document_topic_matrix = self.fit_transform(
            raw_documents, embeddings=embeddings
        )
        time_labels, self.time_bin_edges = self.bin_timestamps(
            timestamps, bins
        )
        n_comp, n_vocab = self.components_.shape
        n_bins = len(self.time_bin_edges) - 1
        self.temporal_components_ = np.full(
            (n_bins, n_comp, n_vocab),
            np.nan,
            dtype=self.components_.dtype,
        )
        self.temporal_importance_ = np.zeros((n_bins, n_comp))
        whitened_embeddings = np.copy(self.embeddings)
        if getattr(self.decomposition, "whiten"):
            whitened_embeddings -= self.decomposition.mean_
        # doc_topic = np.dot(X, self.components_.T)
        for i_timebin in np.unique(time_labels):
            topic_importances = document_topic_matrix[
                time_labels == i_timebin
            ].mean(axis=0)
            self.temporal_importance_[i_timebin, :] = topic_importances
            t_doc_topic = document_topic_matrix[time_labels == i_timebin]
            t_embeddings = whitened_embeddings[time_labels == i_timebin]
            linreg = LinearRegression().fit(t_embeddings, t_doc_topic)
            self.temporal_components_[i_timebin, :, :] = np.dot(
                self.vocab_embeddings, linreg.coef_.T
            ).T
        return document_topic_matrix

model = DynamicS3(10, encoder=trf)
model.fit_dynamic(corpus, timestamps=ts, bins=10)

model.print_topics_over_time()

What do we need to do?

Implement functionality in the original class (SemanticSignalSeparation)
Print both negative and positive words on an axis when calling print_topics_over_time()
Plot negative and positive words when calling plot_topics_over_time()
Since $S^3$ is not proportional, adding a moving average to the plot could be really cool and would give more info on actual dynamics.

The text was updated successfully, but these errors were encountered:

x-tabdeveloping self-assigned this Oct 23, 2024

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Implement Dynamic modeling for $S^3$ #67

Implement Dynamic modeling for $S^3$ #67

x-tabdeveloping commented Oct 23, 2024

Implement Dynamic modeling for $S^3$ #67

Implement Dynamic modeling for $S^3$ #67

Comments

x-tabdeveloping commented Oct 23, 2024

What do we need to do?