Skip to content

Latest commit

 

History

History
26 lines (22 loc) · 2.67 KB

README.md

File metadata and controls

26 lines (22 loc) · 2.67 KB

Breakdown:

  1. Imports:

    • urllib.request: Used to download PDF files from a URL.
    • fitz: Part of the PyMuPDF package, used for working with PDF documents.
    • re: Regular expressions library, used for text preprocessing.
    • numpy: A popular numerical library, used for working with arrays.
    • tensorflow_hub: TensorFlow Hub is a library for reusable machine learning models. In this case, it is used to load the Universal Sentence Encoder (USE) model.
    • openai: The OpenAI API library, used to interact with the GPT-3 API.
    • gradio: A library for creating web-based user interfaces for Python applications.
  2. PDF Processing Functions: The app has several functions for downloading and processing PDF documents:

    • download_pdf(url, output_path): Downloads a PDF file from a given URL and saves it to the specified output path.
    • preprocess(text): Cleans up the text by replacing newline characters and reducing multiple spaces.
    • pdf_to_text(path, start_page=1, end_page=None): Converts a PDF document to a list of strings, where each string represents the text on a page.
    • text_to_chunks(texts, word_length=150, start_page=1): Splits the text from each page into chunks of a specified word length.
  3. SemanticSearch Class: This class is responsible for creating a semantic search engine using the Universal Sentence Encoder (USE) model. It has several methods:

    • __init__(): Initializes the class and loads the USE model.
    • fit(data, batch=1000, n_neighbors=5): Fits the NearestNeighbors model on the provided data (text chunks).
    • __call__(text, return_data=True): Finds the nearest neighbors for a given text query and returns the related text chunks.
    • get_text_embedding(texts, batch=1000): Computes the embeddings for the given texts using the USE model.
  4. Load and Initialize Recommender: The code initializes a SemanticSearch object called recommender and provides a function load_recommender(path, start_page=1) to load a PDF and fit the recommender on the PDF content.

  5. GPT-3 Text Generation: The generate_text(prompt, engine="text-davinci-003") and generate_answer(question) functions are responsible for interacting with the GPT-3 API and generating answers based on the provided question and PDF content.

  6. Web Interface: Gradio library creates a web interface with URL, PDF file, and question input fields, and an answer output. The function question_answer(url, file, question, api_key) manages PDF download, text processing, GPT-3 answer generation, and result display.