Merge pull request #65 from x-tabdeveloping/sentence-trf

Sentence Transformers update
x-tabdeveloping · Oct 14, 2024 · f3f35f5 · f3f35f5
2 parents aaa90a7 + 75a33ab
commit f3f35f5
Show file tree

Hide file tree

Showing 10 changed files with 215 additions and 175 deletions.
diff --git a/.github/workflows/documentation.yml b/.github/workflows/documentation.yml
@@ -23,12 +23,12 @@ jobs:
       - name: Dependencies
         run: |
           python -m pip install --upgrade pip
-          pip install "turftopic[pyro-ppl,docs]"
+          pip install "turftopic[pyro-ppl]" "griffe" "mkdocstrings[python]" "mkdocs" "mkdocs-material"
 
       - name: Build and Deploy
         if: github.event_name == 'push'
         run: mkdocs gh-deploy --force
 
       - name: Build
         if: github.event_name == 'pull_request'
-        run: mkdocs build
+        run: mkdocs build
diff --git a/README.md b/README.md
@@ -20,42 +20,29 @@
 
 > This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
 
-### New in version 0.5.0
+### New in version 0.6.0
 
-#### Hierarchical KeyNMF
+#### Prompting Embedding Models
 
-You can now subdivide topics in KeyNMF at will.
+KeyNMF and clustering topic models can now efficiently utilise asymmetric and instruction-finetuned embedding models.
+This, in combination with the right embedding model, can enhance performance significantly.
 
 ```python
 from turftopic import KeyNMF
-
-model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
-model.hierarchy.divide_children(n_subtopics=3)
-print(model.hierarchy)
-```
-
-```
-Root
-├── windows, dos, os, disk, card, drivers, file, pc, files, microsoft
-│   ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
-│   ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform
-│   └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati
-└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs
-.    ├── 1.0: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers
-.    ├── 1.1: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions
-.    └── 1.2: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer(
+    "intfloat/multilingual-e5-large-instruct",
+    prompts={
+        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
+        "passage": "Passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
 ```
 
-#### FASTopic *(Experimental)*
-
-You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.
-
-```python
-from turftopic import FASTopic
-
-model = FASTopic(10).fit(corpus)
-model.print_topics()
-```
 
 ## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
 [![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)

diff --git a/docs/KeyNMF.md b/docs/KeyNMF.md
@@ -109,6 +109,51 @@ keyword_matrix = model.extract_keywords(corpus)
 model.fit(None, keywords=keyword_matrix)
 ```
 
+## Asymmetric and Instruction-tuned Embedding Models
+
+Some embedding models can be used together with prompting, or encode queries and passages differently.
+This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
+Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
+
+In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
+
+Here's an example of using instruct models for keyword retrieval with KeyNMF.
+In this case, documents will serve as the queries and words as the passages:
+
+```python
+from turftopic import KeyNMF
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer(
+    "intfloat/multilingual-e5-large-instruct",
+    prompts={
+        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
+        "passage": "Passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+And a regular, asymmetric example:
+
+```python
+encoder = SentenceTransformer(
+    "intfloat/e5-large-v2",
+    prompts={
+        "query": "query: "
+        "passage": "passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
+
+
 ## Dynamic Topic Modeling
 
 KeyNMF is also capable of modeling topics over time.

diff --git a/docs/basics.md b/docs/basics.md
@@ -27,7 +27,7 @@ Here's a model that uses E5 large as the embedding model, and only learns words
 from turftopic import SemanticSignalSeparation
 from sklearn.feature_extraction.text import CountVectorizer
 
-model = SemanticSignalSeparation(10, encoder="intfloat/e5-large-v2", vectorizer=CountVectorizer(min_df=20))
+model = SemanticSignalSeparation(10, encoder="all-MiniLM-L6-v2", vectorizer=CountVectorizer(min_df=20))
 ```
 
 You can also use external models for encoding, here's an example with [OpenAI's embedding models](encoders.md#external_embeddings):
@@ -60,6 +60,67 @@ corpus: list[str] = ["this is a a document", "this is yet another document", ...
 model.fit(corpus)
 ```
 
+## Prompting Embedding Models
+
+Some embedding models can be used together with prompting, or encode queries and passages differently.
+This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
+Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.
+
+In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
+
+Here's an example for clustering models:
+```python
+from turftopic import ClusteringTopicModel
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer(
+    "intfloat/multilingual-e5-large-instruct",
+    prompts={
+        "query": "Instruct: Cluster documents according to the topic they are about. Query: "
+        "passage": "Passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = ClusteringTopicModel(encoder=encoder)
+```
+
+You can also use instruct models for keyword retrieval with KeyNMF.
+In this case, documents will serve as the queries and words as the passages:
+
+```python
+from turftopic import KeyNMF
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer(
+    "intfloat/multilingual-e5-large-instruct",
+    prompts={
+        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
+        "passage": "Passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:
+
+```python
+encoder = SentenceTransformer(
+    "intfloat/e5-large-v2",
+    prompts={
+        "query": "query: "
+        "passage": "passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.
+
 ## Precomputing Embeddings
 
 In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
@@ -78,7 +139,7 @@ import numpy as np
 from sentence_transformers import SentenceTransformer
 from turftopic import GMM, ClusteringTopicModel
 
-encoder = SentenceTransformer("intfloat/e5-large-v2")
+encoder = SentenceTransformer("intfloat/e5-large-v2", prompts={"query": "query: ", "passage": "passage: "}, default_prompt_name="query")
 
 corpus: list[str] = ["this is a a document", "this is yet another document", ...]
 embeddings = np.asarray(encoder.encode(corpus))

diff --git a/docs/encoders.md b/docs/encoders.md
@@ -21,6 +21,65 @@ model = GMM(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
 Different encoders have different performance and model sizes.
 To make an informed choice about which embedding model you should be using check out the [Massive Text Embedding Benchmark](https://huggingface.co/blog/mteb).
 
+## Asymmetric and Instruction-tuned Embedding Models
+
+Some embedding models can be used together with prompting, or encode queries and passages differently.
+Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.
+
+In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.
+
+Here's an example of using instruct models for keyword retrieval with KeyNMF.
+In this case, documents will serve as the queries and words as the passages:
+
+```python
+from turftopic import KeyNMF
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer(
+    "intfloat/multilingual-e5-large-instruct",
+    prompts={
+        "query": "Instruct: Retrieve relevant keywords from the given document. Query: "
+        "passage": "Passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+And a regular, asymmetric example:
+
+```python
+encoder = SentenceTransformer(
+    "intfloat/e5-large-v2",
+    prompts={
+        "query": "query: "
+        "passage": "passage: "
+    },
+    # Make sure to set default prompt to query!
+    default_prompt_name="query",
+)
+model = KeyNMF(10, encoder=encoder)
+```
+
+## Performance tips
+
+From `sentence-transformers` version `3.2.0` you can significantly speed up some models by using
+the `onnx` backend instead of regular torch.
+
+```
+pip install sentence-transformers[onnx, onnx-gpu]
+```
+
+```python
+from turftopic import SemanticSignalSeparation
+from sentence_transformers import SentenceTransformer
+
+encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")
+
+model = SemanticSignalSeparation(10, encoder=encoder)
+```
+
 ## External Embeddings
 
 If you do not have the computational resources to run embedding models on your own infrastructure, you can also use high quality 3rd party embeddings.
@@ -33,11 +92,3 @@ Turftopic currently supports OpenAI, Voyage and Cohere embeddings.
 :::turftopic.encoders.OpenAIEmbeddings
 
 :::turftopic.encoders.VoyageEmbeddings
-
-## E5 Embeddings
-
-Most E5 models expect the input to be prefixed with something like `"query: "` (see the [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) model card).  
-In instructional E5 models, it is also possible to add an instruction, following the format `f"Instruct: {task_description} \nQuery: {document}"` (see the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model card).  
-In Turftopic, E5 embeddings including the prefixing is handled by the `E5Encoder`.
-
-:::turftopic.encoders.E5Encoder
diff --git a/pyproject.toml b/pyproject.toml
@@ -6,17 +6,17 @@ line-length=79
 
 [tool.poetry]
 name = "turftopic"
-version = "0.5.4"
+version = "0.6.0"
 description = "Topic modeling with contextual representations from sentence transformers."
 authors = ["Márton Kardos <[email protected]>"]
 license = "MIT"
 readme = "README.md"
 
 [tool.poetry.dependencies]
 python = "^3.9"
-numpy = "^1.23.0"
+numpy = ">=1.23.0"
 scikit-learn = "^1.2.0"
-sentence-transformers = "^2.2.0"
+sentence-transformers = ">=2.2.0"
 torch = "^2.1.0"
 scipy = "^1.10.0"
 rich = "^13.6.0"

diff --git a/turftopic/encoders/__init__.py b/turftopic/encoders/__init__.py
@@ -2,12 +2,10 @@
 from turftopic.encoders.cohere import CohereEmbeddings
 from turftopic.encoders.openai import OpenAIEmbeddings
 from turftopic.encoders.voyage import VoyageEmbeddings
-from turftopic.encoders.e5 import E5Encoder
 
 __all__ = [
     "CohereEmbeddings",
     "OpenAIEmbeddings",
     "VoyageEmbeddings",
     "ExternalEncoder",
-    "E5Encoder",
 ]