-
Imports:
urllib.request
: Used to download PDF files from a URL.fitz
: Part of the PyMuPDF package, used for working with PDF documents.re
: Regular expressions library, used for text preprocessing.numpy
: A popular numerical library, used for working with arrays.tensorflow_hub
: TensorFlow Hub is a library for reusable machine learning models. In this case, it is used to load the Universal Sentence Encoder (USE) model.openai
: The OpenAI API library, used to interact with the GPT-3 API.gradio
: A library for creating web-based user interfaces for Python applications.
-
PDF Processing Functions: The app has several functions for downloading and processing PDF documents:
download_pdf(url, output_path)
: Downloads a PDF file from a given URL and saves it to the specified output path.preprocess(text)
: Cleans up the text by replacing newline characters and reducing multiple spaces.pdf_to_text(path, start_page=1, end_page=None)
: Converts a PDF document to a list of strings, where each string represents the text on a page.text_to_chunks(texts, word_length=150, start_page=1)
: Splits the text from each page into chunks of a specified word length.
-
SemanticSearch Class: This class is responsible for creating a semantic search engine using the Universal Sentence Encoder (USE) model. It has several methods:
__init__()
: Initializes the class and loads the USE model.fit(data, batch=1000, n_neighbors=5)
: Fits the NearestNeighbors model on the provided data (text chunks).__call__(text, return_data=True)
: Finds the nearest neighbors for a given text query and returns the related text chunks.get_text_embedding(texts, batch=1000)
: Computes the embeddings for the given texts using the USE model.
-
Load and Initialize Recommender: The code initializes a
SemanticSearch
object calledrecommender
and provides a functionload_recommender(path, start_page=1)
to load a PDF and fit the recommender on the PDF content. -
GPT-3 Text Generation: The
generate_text(prompt, engine="text-davinci-003")
andgenerate_answer(question)
functions are responsible for interacting with the GPT-3 API and generating answers based on the provided question and PDF content. -
Web Interface: Gradio library creates a web interface with URL, PDF file, and question input fields, and an answer output. The function
question_answer(url, file, question, api_key)
manages PDF download, text processing, GPT-3 answer generation, and result display.