Skip to content

Commit

Permalink
Merge pull request #65 from x-tabdeveloping/sentence-trf
Browse files Browse the repository at this point in the history
Sentence Transformers update
  • Loading branch information
x-tabdeveloping authored Oct 14, 2024
2 parents aaa90a7 + 75a33ab commit f3f35f5
Show file tree
Hide file tree
Showing 10 changed files with 215 additions and 175 deletions.
4 changes: 2 additions & 2 deletions .github/workflows/documentation.yml
Original file line number Diff line number Diff line change
Expand Up @@ -23,12 +23,12 @@ jobs:
- name: Dependencies
run: |
python -m pip install --upgrade pip
pip install "turftopic[pyro-ppl,docs]"
pip install "turftopic[pyro-ppl]" "griffe" "mkdocstrings[python]" "mkdocs" "mkdocs-material"
- name: Build and Deploy
if: github.event_name == 'push'
run: mkdocs gh-deploy --force

- name: Build
if: github.event_name == 'pull_request'
run: mkdocs build
run: mkdocs build
45 changes: 16 additions & 29 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -20,42 +20,29 @@

> This package is still work in progress and scientific papers on some of the novel methods are currently undergoing peer-review. If you use this package and you encounter any problem, let us know by opening relevant issues.
### New in version 0.5.0
### New in version 0.6.0

#### Hierarchical KeyNMF
#### Prompting Embedding Models

You can now subdivide topics in KeyNMF at will.
KeyNMF and clustering topic models can now efficiently utilise asymmetric and instruction-finetuned embedding models.
This, in combination with the right embedding model, can enhance performance significantly.

```python
from turftopic import KeyNMF

model = KeyNMF(2, top_n=15, random_state=42).fit(corpus)
model.hierarchy.divide_children(n_subtopics=3)
print(model.hierarchy)
```

```
Root
├── windows, dos, os, disk, card, drivers, file, pc, files, microsoft
│ ├── 0.0: dos, file, disk, files, program, windows, disks, shareware, norton, memory
│ ├── 0.1: os, unix, windows, microsoft, apps, nt, ibm, ms, os2, platform
│ └── 0.2: card, drivers, monitor, driver, vga, ram, motherboard, cards, graphics, ati
└── 1: atheism, atheist, atheists, religion, christians, religious, belief, christian, god, beliefs
. ├── 1.0: atheism, alt, newsgroup, reading, faq, islam, questions, read, newsgroups, readers
. ├── 1.1: atheists, atheist, belief, theists, beliefs, religious, religion, agnostic, gods, religions
. └── 1.2: morality, bible, christian, christians, moral, christianity, biblical, immoral, god, religion
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

#### FASTopic *(Experimental)*

You can now use [FASTopic](https://github.com/BobXWu/FASTopic) inside Turftopic.

```python
from turftopic import FASTopic

model = FASTopic(10).fit(corpus)
model.print_topics()
```

## Basics [(Documentation)](https://x-tabdeveloping.github.io/turftopic/)
[![Open in Colab](https://colab.research.google.com/assets/colab-badge.svg)](https://colab.research.google.com/github/x-tabdeveloping/turftopic/blob/main/examples/basic_example_20newsgroups.ipynb)
Expand Down
45 changes: 45 additions & 0 deletions docs/KeyNMF.md
Original file line number Diff line number Diff line change
Expand Up @@ -109,6 +109,51 @@ keyword_matrix = model.extract_keywords(corpus)
model.fit(None, keywords=keyword_matrix)
```

## Asymmetric and Instruction-tuned Embedding Models

Some embedding models can be used together with prompting, or encode queries and passages differently.
This is important for KeyNMF, as it is explicitly based on keyword retrieval, and its performance can be substantially enhanced by using asymmetric or prompted embeddings.
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.

In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.

Here's an example of using instruct models for keyword retrieval with KeyNMF.
In this case, documents will serve as the queries and words as the passages:

```python
from turftopic import KeyNMF
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

And a regular, asymmetric example:

```python
encoder = SentenceTransformer(
"intfloat/e5-large-v2",
prompts={
"query": "query: "
"passage": "passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.


## Dynamic Topic Modeling

KeyNMF is also capable of modeling topics over time.
Expand Down
65 changes: 63 additions & 2 deletions docs/basics.md
Original file line number Diff line number Diff line change
Expand Up @@ -27,7 +27,7 @@ Here's a model that uses E5 large as the embedding model, and only learns words
from turftopic import SemanticSignalSeparation
from sklearn.feature_extraction.text import CountVectorizer

model = SemanticSignalSeparation(10, encoder="intfloat/e5-large-v2", vectorizer=CountVectorizer(min_df=20))
model = SemanticSignalSeparation(10, encoder="all-MiniLM-L6-v2", vectorizer=CountVectorizer(min_df=20))
```

You can also use external models for encoding, here's an example with [OpenAI's embedding models](encoders.md#external_embeddings):
Expand Down Expand Up @@ -60,6 +60,67 @@ corpus: list[str] = ["this is a a document", "this is yet another document", ...
model.fit(corpus)
```

## Prompting Embedding Models

Some embedding models can be used together with prompting, or encode queries and passages differently.
This can significantly influence performance, especially in the case of models that are based on retrieval ([KeyNMF](KeyNMF.md)) or clustering ([ClusteringTopicModel](clustering.md)).
Microsoft's E5 models are, for instance all prompted by default, and it would be detrimental to performance not to do so yourself.

In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.

Here's an example for clustering models:
```python
from turftopic import ClusteringTopicModel
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Cluster documents according to the topic they are about. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = ClusteringTopicModel(encoder=encoder)
```

You can also use instruct models for keyword retrieval with KeyNMF.
In this case, documents will serve as the queries and words as the passages:

```python
from turftopic import KeyNMF
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

When using KeyNMF with E5, make sure to specify the prompts even if you're not using instruct models:

```python
encoder = SentenceTransformer(
"intfloat/e5-large-v2",
prompts={
"query": "query: "
"passage": "passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

Setting the default prompt to `query` is especially important, when you are precomputing embeddings, as `query` should always be your default prompt to embed documents with.

## Precomputing Embeddings

In order to cut down on costs/computational load when fitting multiple models in a row, you might want to encode the documents before fitting a model.
Expand All @@ -78,7 +139,7 @@ import numpy as np
from sentence_transformers import SentenceTransformer
from turftopic import GMM, ClusteringTopicModel

encoder = SentenceTransformer("intfloat/e5-large-v2")
encoder = SentenceTransformer("intfloat/e5-large-v2", prompts={"query": "query: ", "passage": "passage: "}, default_prompt_name="query")

corpus: list[str] = ["this is a a document", "this is yet another document", ...]
embeddings = np.asarray(encoder.encode(corpus))
Expand Down
67 changes: 59 additions & 8 deletions docs/encoders.md
Original file line number Diff line number Diff line change
Expand Up @@ -21,6 +21,65 @@ model = GMM(10, encoder="paraphrase-multilingual-MiniLM-L12-v2")
Different encoders have different performance and model sizes.
To make an informed choice about which embedding model you should be using check out the [Massive Text Embedding Benchmark](https://huggingface.co/blog/mteb).

## Asymmetric and Instruction-tuned Embedding Models

Some embedding models can be used together with prompting, or encode queries and passages differently.
Microsoft's E5 models are, for instance, all prompted by default, and it would be detrimental to performance not to do so yourself.

In these cases, you're better off NOT passing a string to Turftopic models, but explicitly loading the model using `sentence-transformers`.

Here's an example of using instruct models for keyword retrieval with KeyNMF.
In this case, documents will serve as the queries and words as the passages:

```python
from turftopic import KeyNMF
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer(
"intfloat/multilingual-e5-large-instruct",
prompts={
"query": "Instruct: Retrieve relevant keywords from the given document. Query: "
"passage": "Passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

And a regular, asymmetric example:

```python
encoder = SentenceTransformer(
"intfloat/e5-large-v2",
prompts={
"query": "query: "
"passage": "passage: "
},
# Make sure to set default prompt to query!
default_prompt_name="query",
)
model = KeyNMF(10, encoder=encoder)
```

## Performance tips

From `sentence-transformers` version `3.2.0` you can significantly speed up some models by using
the `onnx` backend instead of regular torch.

```
pip install sentence-transformers[onnx, onnx-gpu]
```

```python
from turftopic import SemanticSignalSeparation
from sentence_transformers import SentenceTransformer

encoder = SentenceTransformer("all-MiniLM-L6-v2", backend="onnx")

model = SemanticSignalSeparation(10, encoder=encoder)
```

## External Embeddings

If you do not have the computational resources to run embedding models on your own infrastructure, you can also use high quality 3rd party embeddings.
Expand All @@ -33,11 +92,3 @@ Turftopic currently supports OpenAI, Voyage and Cohere embeddings.
:::turftopic.encoders.OpenAIEmbeddings

:::turftopic.encoders.VoyageEmbeddings

## E5 Embeddings

Most E5 models expect the input to be prefixed with something like `"query: "` (see the [multilingual-e5-small](https://huggingface.co/intfloat/multilingual-e5-small) model card).
In instructional E5 models, it is also possible to add an instruction, following the format `f"Instruct: {task_description} \nQuery: {document}"` (see the [multilingual-e5-large-instruct](https://huggingface.co/intfloat/multilingual-e5-large-instruct) model card).
In Turftopic, E5 embeddings including the prefixing is handled by the `E5Encoder`.

:::turftopic.encoders.E5Encoder
6 changes: 3 additions & 3 deletions pyproject.toml
Original file line number Diff line number Diff line change
Expand Up @@ -6,17 +6,17 @@ line-length=79

[tool.poetry]
name = "turftopic"
version = "0.5.4"
version = "0.6.0"
description = "Topic modeling with contextual representations from sentence transformers."
authors = ["Márton Kardos <[email protected]>"]
license = "MIT"
readme = "README.md"

[tool.poetry.dependencies]
python = "^3.9"
numpy = "^1.23.0"
numpy = ">=1.23.0"
scikit-learn = "^1.2.0"
sentence-transformers = "^2.2.0"
sentence-transformers = ">=2.2.0"
torch = "^2.1.0"
scipy = "^1.10.0"
rich = "^13.6.0"
Expand Down
2 changes: 0 additions & 2 deletions turftopic/encoders/__init__.py
Original file line number Diff line number Diff line change
Expand Up @@ -2,12 +2,10 @@
from turftopic.encoders.cohere import CohereEmbeddings
from turftopic.encoders.openai import OpenAIEmbeddings
from turftopic.encoders.voyage import VoyageEmbeddings
from turftopic.encoders.e5 import E5Encoder

__all__ = [
"CohereEmbeddings",
"OpenAIEmbeddings",
"VoyageEmbeddings",
"ExternalEncoder",
"E5Encoder",
]
Loading

0 comments on commit f3f35f5

Please sign in to comment.