Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

feat: Qdrant support #730

Merged
merged 13 commits into from
Dec 11, 2024
Merged

feat: Qdrant support #730

merged 13 commits into from
Dec 11, 2024

Conversation

Anush008
Copy link
Contributor

@Anush008 Anush008 commented Nov 28, 2024

Description

This PR adds support for Qdrant - https://qdrant.tech to be used an external database for vector search.

Qdrant can be run with :

docker run -p 6333:6333 qdrant/qdrant

A dashboard will be accessible at http://localhost:6333/dashboard.

Testing

I've Q&A tested QdrantVectorStore implementation externally.

Signed-off-by: Anush008 <[email protected]>
Signed-off-by: Anush008 <[email protected]>
@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Nov 28, 2024
paperqa/llms.py Outdated Show resolved Hide resolved
paperqa/llms.py Outdated Show resolved Hide resolved
pyproject.toml Show resolved Hide resolved
.github/workflows/tests.yml Outdated Show resolved Hide resolved
tests/test_paperqa.py Show resolved Hide resolved
pyproject.toml Outdated Show resolved Hide resolved
Signed-off-by: Anush008 <[email protected]>
@Anush008 Anush008 force-pushed the qdrant branch 2 times, most recently from dadc7d4 to 9836e09 Compare December 5, 2024 07:59
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 5, 2024

Hey @jamesbraza. Could you please approve the CI?

Signed-off-by: Anush008 <[email protected]>
Signed-off-by: Anush008 <[email protected]>
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 6, 2024

Weird that the mailman pre-commit doesn't complain locally. I've tried to do a patch.

@Anush008
Copy link
Contributor Author

Anush008 commented Dec 6, 2024

Alright. That's through.
I guess the OpenAI errors are unrelated.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Few more comments, looking good so far

paperqa/llms.py Outdated Show resolved Hide resolved
paperqa/llms.py Outdated Show resolved Hide resolved
Signed-off-by: Anush008 <[email protected]>
@Anush008
Copy link
Contributor Author

Anush008 commented Dec 7, 2024

I believe these are the same OpenAI failures.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Hi @Anush008 it looks great. Can you merge or rebase atop main for #752, and confirm if QdrantVectorStore needs any changes?

@Anush008
Copy link
Contributor Author

confirm if QdrantVectorStore needs any changes?

Should be fine.

Copy link
Collaborator

@jamesbraza jamesbraza left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice work @Anush008 , thanks for this

@dosubot dosubot bot added the lgtm This PR has been approved by a maintainer label Dec 11, 2024
@jamesbraza jamesbraza merged commit 0f5c494 into Future-House:main Dec 11, 2024
3 of 5 checks passed
@Anush008 Anush008 deleted the qdrant branch December 11, 2024 01:22
@ThomasRochefortB
Copy link

ThomasRochefortB commented Dec 20, 2024

@Anush008 I managed to successfully create a docs object and push it to Qdrant using the following:

from paperqa import QdrantVectorStore, Docs
from qdrant_client import QdrantClient
import nest_asyncio

nest_asyncio.apply()

client = QdrantClient(url="localhost", port=6333)
vectorstore = QdrantVectorStore(client=client,
                                collection_name="test-collection")

docs = Docs(texts_index=vectorstore)
docs.add("testpaper.pdf")
docs.texts_index.add_texts_and_embeddings(docs.texts)

My question now is:

  • Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.
  • This could be a useful add to the README.md to document this.

@Anush008
Copy link
Contributor Author

Is there a clever way to rebuild the Docs() object from the QdrantVectorStore directly? There seems to be everything we need persisted in the Qdrant collection.

I think no, as of yet. We can add a something like from_existing(...) for this purpose.

@ThomasRochefortB
Copy link

For now I am using this which seems to work:

from paperqa import QdrantVectorStore, Docs, Text, Doc
from qdrant_client import QdrantClient
import nest_asyncio
import asyncio

nest_asyncio.apply()

async def recreate_docs_from_qdrant(client: QdrantClient, collection_name: str) -> Docs:
    # Initialize empty Docs with the existing vector store
    vectorstore = QdrantVectorStore(
        client=client,
        collection_name=collection_name
    )
    
    docs = Docs(texts_index=vectorstore)
    
    # Get all points from the collection
    points = client.scroll(
        collection_name=collection_name,
        with_payload=True,
        with_vectors=True,
        limit=100  # adjust based on your needs
    )[0]
    
    # Reconstruct the texts and docs
    for point in points:
        payload = point.payload
        doc = payload['doc']
        
        if doc['dockey'] not in docs.docs:
            docs.docs[doc['dockey']] = Doc(
                docname=doc['docname'],
                citation=doc['citation'],
                dockey=doc['dockey']
            )
            docs.docnames.add(doc['docname'])
        
        # Reconstruct Text object
        text = Text(
            text=payload['text'],
            name=payload['name'],
            doc=docs.docs[doc['dockey']],
            embedding=point.vector
        )
        docs.texts.append(text)
    
    return docs

# Usage:
client = QdrantClient(url="localhost", port=6333)
docs = asyncio.run(recreate_docs_from_qdrant(client, "test-collection"))

I think it's clunky to reload the entire vectorstore into RAM however. I wonder if we could just use the Qdrant store as the Docs() object itself.

@ThomasRochefortB
Copy link

ThomasRochefortB commented Dec 21, 2024

@Anush008 I have created a Docs.load_docs_from_qdrant function to reconstruct the Docs() object from Qdrant in the following #776 . Would love your opinion!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request lgtm This PR has been approved by a maintainer size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

3 participants