- Handles the retrieval of local documents in a 'data/' subdirectory
- Embeds all loaded documents in a local ChromaDB
- Allows the user to query the embedded documents
- Custom retry functions
- Modular, self-contained PDFProcessor class for reuse
- Logging and extensive documentation throughout the script
- PDFProcessor Class: Handles PDF document processing, similarity search, and question answering.
- Environment Variables: Requires
OPENAI_API_KEY
for authentication with OpenAI services. - Document Processing: Loads, splits, and prepares PDF documents for querying.
- Similarity Search: Uses Chroma for similarity searches in the document content based on user queries.
- Question Answering: Integrates a QA chain from LangChain to answer queries using processed documents.
- Initialize
PDFProcessor
to manage PDFs and set up environment variables. - Load PDF documents from a specified directory for processing.
- Conduct a similarity search across processed documents using Chroma.
- Use a QA chain to answer questions based on the similarity search results.
- Error Handling: Implements retrying mechanisms for environment variable loading and file processing.
- PDF Loading: Utilizes
PyPDFLoader
for reading PDF files. - Text Splitting: Splits documents into chunks for efficient processing.
- Embeddings and LLM: Uses OpenAI embeddings and language models for generating document embeddings and answering questions.
- User Interaction: Allows users to input queries for searching and answering.
- Load and process PDF documents from a directory.
- Create a Chroma object for document similarity search.
- Load a QA chain.
- Accept user queries for similarity searches and question answering.
- Display results based on the query and processed documents.