Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Feat/qdrant docs reconstruct #776

Open
wants to merge 4 commits into
base: main
Choose a base branch
from

Conversation

ThomasRochefortB
Copy link

This PR adds a Docs.load_docs_from_qdrant() function in order to rebuild a Docs() object from a QdrantVectorStore.

Assuming you have built a QdrantVectorStore using something like:

from paperqa import QdrantVectorStore, Docs
from qdrant_client import QdrantClient
import nest_asyncio
import os 

nest_asyncio.apply()

client = QdrantClient(url="localhost", port=6333)
vectorstore = QdrantVectorStore(client=client,
                                collection_name="test-collection")

# Loop through the ./downloaded_papers/ directory and run docs.add() on each file:
for file in os.listdir("downloaded_papers"):
    docs = Docs(texts_index=vectorstore)

    docs.add("./downloaded_papers/"+file)

    docs.texts_index.add_texts_and_embeddings(docs.texts)

Then you can rebuild the Docs object using:

import asyncio
from qdrant_client import AsyncQdrantClient
from paperqa import Docs

async def test_load_docs_from_qdrant():
    client = AsyncQdrantClient(url="http://localhost:6333")
    docs = await Docs.load_docs_from_qdrant(
        client=client,
        collection_name="test-collection",
        vector_name=None,
        batch_size=100,
        max_concurrent_requests=5
    )
    print(docs)
    assert len(docs.texts) > 0
    for text in docs.texts:
        assert text.text is not None
        assert text.doc is not None
    return docs

if __name__ == "__main__":
    # Run the async test
    docs = asyncio.run(test_load_docs_from_qdrant())

@dosubot dosubot bot added size:L This PR changes 100-499 lines, ignoring generated files. enhancement New feature or request labels Dec 21, 2024
paperqa/llms.py Outdated Show resolved Hide resolved
paperqa/llms.py Outdated Show resolved Hide resolved
paperqa/docs.py Outdated
@@ -848,3 +855,120 @@ async def aquery( # noqa: PLR0912
session.context = context_str

return session

async def load_docs_from_qdrant(
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

So we want to keep Docs as a lightweight and general object. Let's keep DB provider stuff out of Docs.

Can you perhaps:

  • Make this a classmethod or staticmethod on the Qdrant entity
  • Make a free function that in that module

paperqa/docs.py Outdated Show resolved Hide resolved
paperqa/docs.py Show resolved Hide resolved
@ThomasRochefortB
Copy link
Author

Hey @jamesbraza , thank you so much for your time on the review!

  • I have committed all of your suggestions and asks. Let me know what you think.
  • I have not touched the quirky if asyncio.iscoroutinefunction(... code as I will open a quick PR for this first.
  • I am also expecting some linting issues.

@ThomasRochefortB
Copy link
Author

#778 includes the async changes @jamesbraza

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request size:L This PR changes 100-499 lines, ignoring generated files.
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants