diff --git a/docs/2.developers/7.templates/.multimodal-rag/article.py b/docs/2.developers/7.templates/.multimodal-rag/article.py index d78f19cf..abd4631c 100644 --- a/docs/2.developers/7.templates/.multimodal-rag/article.py +++ b/docs/2.developers/7.templates/.multimodal-rag/article.py @@ -26,12 +26,7 @@ # # **Multimodal RAG for PDFs with Text, Images, and Charts** # + [markdown] id="TKIZcJebzBwZ" -# ::true-img -# --- -# src: '/assets/content/showcases/multimodal-RAG/multimodalRAG-blog-banner.svg' -# alt: "Multimodal RAG overview" -# --- -# :: +# ![Multimodal RAG overview](/assets/content/showcases/multimodal-RAG/multimodalRAG-blog-banner.svg) # + [markdown] id="YQbScd7hUSrw" # Multimodal Retrieval-Augmented Generation (MM-RAG) systems are transforming the way you enhance Language Models and Generative AI. By incorporating a variety of data types within one application, these systems significantly expand their capabilities and applications. @@ -119,12 +114,7 @@ # # + [markdown] colab={"base_uri": "https://localhost:8080/", "height": 419} id="l_f-zWx3yFku" outputId="6a614d59-3ae6-41d2-8c7c-a4c1258c9f94" -# ::true-img -# --- -# src: '/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o.gif' -# alt: "Multimodal RAG overview" -# --- -# :: +# ![Multimodal RAG overview](/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o.gif) # + [markdown] id="znH7QGu2-6qz" # ### Key Components of the Multimodal RAG Architecture diff --git a/examples/notebooks/showcases/multimodal-rag.ipynb b/examples/notebooks/showcases/multimodal-rag.ipynb index a8e785b8..ef7d33c4 100644 --- a/examples/notebooks/showcases/multimodal-rag.ipynb +++ b/examples/notebooks/showcases/multimodal-rag.ipynb @@ -69,12 +69,7 @@ "id": "TKIZcJebzBwZ" }, "source": [ - "::true-img\n", - "---\n", - "src: '/assets/content/showcases/multimodal-RAG/multimodalRAG-blog-banner.svg'\n", - "alt: \"Multimodal RAG overview\"\n", - "---\n", - "::" + "![Multimodal RAG overview](https://pathway.com/assets/content/showcases/multimodal-RAG/multimodalRAG-blog-banner.svg)" ] }, { @@ -86,7 +81,7 @@ "source": [ "Multimodal Retrieval-Augmented Generation (MM-RAG) systems are transforming the way you enhance Language Models and Generative AI. By incorporating a variety of data types within one application, these systems significantly expand their capabilities and applications.\n", "\n", - "While traditional [RAG systems](https://pathway.com/blog/retrieval-augmented-generation-beginners-guide-rag-apps) primarily use and parse text, Multimodal RAG systems integrate multimedia elements such as images, audio, and video. This integration is beneficial even for use cases that might initially seem like pure text scenarios, such as handling charts, data, and information stored as images.\n", + "While traditional [RAG systems](/blog/retrieval-augmented-generation-beginners-guide-rag-apps) primarily use and parse text, Multimodal RAG systems integrate multimedia elements such as images, audio, and video. This integration is beneficial even for use cases that might initially seem like pure text scenarios, such as handling charts, data, and information stored as images.\n", "\n", "**By the end of this Multimodal RAG app template, you will:**\n", "\n", @@ -98,7 +93,7 @@ "\n", "\n", "\n", - "If you want to skip the explanations, you can directly find the code here: [Jump to Code Cell](#scrollTo=ptCZWfEz_cnc&line=1&uniqifier=1)." + "If you want to skip the explanations, you can directly find the code [here](#step-by-step-guide-for-multimodal-rag)." ] }, { @@ -123,7 +118,7 @@ "source": [ "### How is Multimodal RAG Different from existing RAG?\n", "\n", - "Currently, most [RAG applications](https://pathway.com/blog/retrieval-augmented-generation-beginners-guide-rag-apps) are mostly limited to text-based data. This is changing with new generative AI models like GPT-4o, Gemini Pro, Claude-3.5 Sonnet, and open-source alternatives like LLaVA, which understand both text and images. Multimodal RAG systems leverage these models to give more coherent outputs, especially for complex queries requiring diverse information formats. This approach significantly enhances performance, as demonstrated in the example below.\n", + "Currently, most [RAG applications](/blog/retrieval-augmented-generation-beginners-guide-rag-apps) are mostly limited to text-based data. This is changing with new generative AI models like GPT-4o, Gemini Pro, Claude-3.5 Sonnet, and open-source alternatives like LLaVA, which understand both text and images. Multimodal RAG systems leverage these models to give more coherent outputs, especially for complex queries requiring diverse information formats. This approach significantly enhances performance, as demonstrated in the example below.\n", "\n", "\n", "Combining this with advanced RAG techniques like adaptive RAG, reranking, and hybrid indexing further improves MM-RAG reliability." @@ -142,12 +137,7 @@ "outputId": "5e64ff49-f494-40c5-eea9-31e4d7218f5d" }, "source": [ - "::true-img\n", - "---\n", - "src: '/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o_with_pathway_comparison.gif'\n", - "alt: \"Multimodal RAG overview\"\n", - "---\n", - "::" + "![Multimodal RAG overview](https://pathway.com/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o_with_pathway_comparison.gif)" ] }, { @@ -198,7 +188,7 @@ "source": [ "## **What's the main difference between LlamaIndex and Pathway?**\n", "\n", - "Pathway offers an indexing solution that always provides the latest information to your LLM application: Pathway Vector Store preprocesses and indexes your data in real time, always giving up-to-date answers. LlamaIndex is a framework for writing LLM-enabled applications. Pathway and LlamaIndex are best [used together](https://pathway.com/developers/showcases/llamaindex-pathway). Pathway vector store is natively available in LlamaIndex." + "Pathway offers an indexing solution that always provides the latest information to your LLM application: Pathway Vector Store preprocesses and indexes your data in real time, always giving up-to-date answers. LlamaIndex is a framework for writing LLM-enabled applications. Pathway and LlamaIndex are best [used together](/developers/templates/llamaindex-pathway). Pathway vector store is natively available in LlamaIndex." ] }, { @@ -231,12 +221,7 @@ "outputId": "6a614d59-3ae6-41d2-8c7c-a4c1258c9f94" }, "source": [ - "::true-img\n", - "---\n", - "src: '/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o.gif'\n", - "alt: \"Multimodal RAG overview\"\n", - "---\n", - "::" + "![Multimodal RAG overview](https://pathway.com/assets/content/showcases/multimodal-RAG/multimodalRAG-gpt4o.gif)" ] }, { @@ -276,7 +261,7 @@ "\n", "Here we use Open AI\u2019s popular Multimodal LLM, [**GPT-4o**](https://openai.com/index/hello-gpt-4o/). It\u2019s used at two key stages:\n", "1. **Parsing Process**: Tables are extracted as images, and GPT-4o then explains the content of these tables in detail. The explained content is saved with the document chunk into the index for easy searchability.\n", - " \n", + "\n", "2. **Answering Questions**: Questions are sent to the LLM with the relevant context, including parsed tables. This allows the generation of accurate responses based on the comprehensive multimodal context.\n", "\n" ] @@ -288,13 +273,9 @@ "id": "5rHNeRXI8t2C" }, "source": [ - "### Install Required Libraries\n", - "\n", - "In this cell, we install all the necessary libraries required for the project. These libraries include:\n", + "### Install Pathway\n", "\n", - "- **pathway[all]>=0.13.0**: The 'Pathway' package\n", - "- **python-dotenv==1.0.1**: Manages environment variables from a `.env` file.\n", - "- **mpmath==1.3.0**: A library for arbitrary-precision arithmetic." + "Install Pathway and all its optional packages." ] }, { @@ -307,7 +288,7 @@ }, "outputs": [], "source": [ - "!pip install 'pathway[all]>=0.13.0' python-dotenv==1.0.1 mpmath==1.3.0" + "!pip install 'pathway[all]>=0.13.0'" ] }, { @@ -319,7 +300,7 @@ "source": [ "### Set Up OpenAI API Key\n", "\n", - "In this cell, we set the OpenAI API key as an environment variable. Replace the placeholder with your actual API key.\n" + "Set the OpenAI API key as an environment variable. Replace the placeholder with your actual API key.\n" ] }, { @@ -332,7 +313,7 @@ }, "outputs": [], "source": [ - "OPENAI_API_KEY = \"Paste Your openAI API Key here\"" + "OPENAI_API_KEY = \"Paste Your OpenAI API Key here\"" ] }, { @@ -361,17 +342,16 @@ "\n", "This cell sets up necessary imports and environment variables for using Pathway and related functionalities.\n", "\n", - "### Key Imports:\n", + "#### Key Imports:\n", "- **pathway**: Main library for document processing and question answering.\n", - "- **dotenv**: Loads environment variables from a `.env` file.\n", "- **logging**: Captures logs for debugging.\n", "\n", - "### Modules:\n", - "- **udfs.DiskCache, ExponentialBackoffRetryStrategy**: Modules for caching and retry strategies.\n", - "- **xpacks.llm**: Various tools for leveraging Large Language Models effectively.\n", - "- **llm.parsers.OpenParse**: The `OpenParse` class efficiently handles document parsing tasks, including text extraction and table parsing, providing a streamlined approach for document analysis and content extraction.\n", - "- **llm.question_answering.BaseRAGQuestionAnswerer**: Sets up a base model for question answering using RAG.\n", - "- **llm.vector_store.VectorStoreServer**: Handles document vector storage and retrieval.\n" + "#### Modules:\n", + "- **[udfs.DiskCache](/developers/api-docs/udfs#pathway.udfs.DiskCache), [udfs.ExponentialBackoffRetryStrategy](/developers/api-docs/udfs#pathway.udfs.ExponentialBackoffRetryStrategy)**: Modules for caching and retry strategies.\n", + "- **[xpacks.llm](/developers/user-guide/llm-xpack/overview)**: Various tools for leveraging Large Language Models effectively.\n", + "- **[llm.parsers.OpenParse](/developers/api-docs/pathway-xpacks-llm/parsers)**: The `OpenParse` class efficiently handles document parsing tasks, including text extraction and table parsing, providing a streamlined approach for document analysis and content extraction.\n", + "- **[llm.question_answering.BaseRAGQuestionAnswerer](/developers/api-docs/pathway-xpacks-llm/question_answering)**: Sets up a base model for question answering using RAG.\n", + "- **[llm.vector_store.VectorStoreServer](/developers/api-docs/pathway-xpacks-llm/vectorstore)**: Handles document vector storage and retrieval.\n" ] }, { @@ -384,19 +364,17 @@ "height": 17 }, "id": "_2spSj2kbDfW", + "lines_to_next_cell": 2, "outputId": "d7a412a6-a295-4fcb-fb10-435d1e8b192d" }, "outputs": [], "source": [ "import logging\n", "\n", - "os.environ[\"TESSDATA_PREFIX\"] = (\n", - " \"/usr/share/tesseract/tessdata/\"\n", - ")\n", + "os.environ[\"TESSDATA_PREFIX\"] = \"/usr/share/tesseract/tessdata/\"\n", "import pathway as pw\n", "from pathway.udfs import DiskCache, ExponentialBackoffRetryStrategy\n", - "from pathway.xpacks.llm import embedders, llms, parsers, prompts, splitters\n", - "from pathway.xpacks.llm.parsers import OpenParse\n", + "from pathway.xpacks.llm import embedders, llms, parsers, prompts\n", "from pathway.xpacks.llm.question_answering import BaseRAGQuestionAnswerer\n", "from pathway.xpacks.llm.vector_store import VectorStoreServer" ] @@ -405,22 +383,6 @@ "cell_type": "code", "execution_count": null, "id": "22", - "metadata": { - "id": "lZyfwPQ_d-SH", - "lines_to_next_cell": 2 - }, - "outputs": [], - "source": [ - "# To use advanced features with Pathway Scale, get your free license key from\n", - "# https://pathway.com/features and paste it below.\n", - "# To use Pathway Community, comment out the line below.\n", - "pw.set_license_key(\"demo-license-key-with-telemetry\")" - ] - }, - { - "cell_type": "code", - "execution_count": null, - "id": "23", "metadata": { "id": "OszIyJyOd_X7" }, @@ -435,7 +397,7 @@ }, { "cell_type": "markdown", - "id": "24", + "id": "23", "metadata": { "id": "_iGdygT2uYQ1" }, @@ -445,87 +407,84 @@ }, { "cell_type": "markdown", - "id": "25", + "id": "24", "metadata": { "id": "ublsBY0GFAbk" }, "source": [ "#### **Create Data Directory**\n", "\n", - "Create a 'data' directory if it doesn't already exist. This is where the uploaded files will be stored.Then upload your pdf document.\n", + "Create a 'data' directory if it doesn't already exist. This is where the uploaded files will be stored. Then upload your pdf document.\n", "\n", - "You can also omit this cell if you are running locally on your system. Create a data folder in the current directory and upload the files. In that case please comment out this cell as this is for colab only." + "You can also omit this cell if you are running locally on your system. Create a data folder in the current directory and copy the files. In that case please comment out this cell as this is for colab only.\n", + "Create the `data` folder if it doesn't exist" ] }, { "cell_type": "code", "execution_count": null, - "id": "26", - "metadata": { - "id": "W6LonnfM9UqW" - }, + "id": "25", + "metadata": {}, "outputs": [], "source": [ - "# Create the 'data' folder if it doesn't exist\n", "!mkdir -p data" ] }, { "cell_type": "code", "execution_count": null, - "id": "27", + "id": "26", "metadata": { "id": "Jaqzsw6r9V8u" }, "outputs": [], "source": [ - "%%capture --no-display\n", "# default file you can use to test\n", "# to use your own data via the Colab UI, click on the 'Files' tab in the left sidebar, go to data folder (that was created prior to this) then drag and drop your files there.\n", "\n", - "!wget -P ./data/ https://github.com/pathwaycom/llm-app/raw/main/examples/pipelines/gpt_4o_multimodal_rag/data/20230203_alphabet_10K.pdf" + "!wget -q -P ./data/ https://github.com/pathwaycom/llm-app/raw/main/examples/pipelines/gpt_4o_multimodal_rag/data/20230203_alphabet_10K.pdf" ] }, { "cell_type": "markdown", - "id": "28", + "id": "27", "metadata": { "id": "D7HFGv7ZFl_g" }, "source": [ - "#### **Read Document**\n", + "#### **Read Documents**\n", "\n", - "Read the document from the data folder. This cell assumes that the uploaded file is a sample document in the 'data' folder.\n" + "Read the documents from the data folder. This cell assumes that the uploaded files are in the `data` folder.\n" ] }, { "cell_type": "code", "execution_count": null, - "id": "29", + "id": "28", "metadata": { "id": "vegYHrXjeufm" }, "outputs": [], "source": [ "folder = pw.io.fs.read(\n", - " path=\"./data/\",\n", - " format=\"binary\",\n", - " with_metadata=True,\n", - " )\n", + " path=\"./data/\",\n", + " format=\"binary\",\n", + " with_metadata=True,\n", + ")\n", "sources = [\n", - " folder,\n", - " ] # define the inputs (local folders & files, google drive, sharepoint, ...)\n", + " folder,\n", + "] # define the inputs (local folders & files, google drive, sharepoint, ...)\n", "chat = llms.OpenAIChat(\n", - " model=\"gpt-4o\",\n", - " retry_strategy=ExponentialBackoffRetryStrategy(max_retries=6),\n", - " cache_strategy=DiskCache(),\n", - " temperature=0.0,\n", - " )" + " model=\"gpt-4o\",\n", + " retry_strategy=ExponentialBackoffRetryStrategy(max_retries=6),\n", + " cache_strategy=DiskCache(),\n", + " temperature=0.0,\n", + ")" ] }, { "cell_type": "markdown", - "id": "30", + "id": "29", "metadata": { "id": "63WRqIkRF3k5" }, @@ -538,7 +497,7 @@ { "cell_type": "code", "execution_count": null, - "id": "31", + "id": "30", "metadata": { "id": "ICYNrImFe4u9", "lines_to_next_cell": 2 @@ -554,7 +513,7 @@ { "cell_type": "code", "execution_count": null, - "id": "32", + "id": "31", "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -583,7 +542,7 @@ { "cell_type": "code", "execution_count": null, - "id": "33", + "id": "32", "metadata": { "id": "znnkEunOlrA0" }, @@ -598,7 +557,7 @@ }, { "cell_type": "markdown", - "id": "34", + "id": "33", "metadata": { "id": "SriYOxHiBn_6" }, @@ -609,7 +568,7 @@ { "cell_type": "code", "execution_count": null, - "id": "35", + "id": "34", "metadata": { "id": "063pnlvpqB0q", "lines_to_next_cell": 2 @@ -621,24 +580,12 @@ }, { "cell_type": "markdown", - "id": "36", - "metadata": { - "id": "RhkXzt5cGBW4" - }, - "source": [ - "\n", - "\n", - "List documents processed by the server using the `requests` library. This is an alternative to using the curl command.\n" - ] - }, - { - "cell_type": "markdown", - "id": "37", + "id": "35", "metadata": { "id": "BYNQh0JPGKBA" }, "source": [ - "#### **Ask Questions and Get answers**\n", + "#### **Ask Questions and Get Answers**\n", "\n", "Query the server to get answers from the documents. This cell sends a prompt to the server and receives the response.\n", "\n", @@ -648,7 +595,7 @@ { "cell_type": "code", "execution_count": null, - "id": "38", + "id": "36", "metadata": { "colab": { "base_uri": "https://localhost:8080/" @@ -664,14 +611,14 @@ }, { "cell_type": "markdown", - "id": "39", + "id": "37", "metadata": { "id": "S3zsr-NGop8B" }, "source": [ "## **Conclusion**\n", "This is how you can easily implement a Multimodal RAG Pipeline using GPT-4o and Pathway. You used the [BaseRAGQuestionAnswerer](https://pathway.com/developers/api-docs/pathway-xpacks-llm/question_answering/#pathway.xpacks.llm.question_answering.BaseRAGQuestionAnswerer) class from [pathway.xpacks](https://pathway.com/developers/user-guide/llm-xpack/overview/), which integrates the foundational components for our RAG application, including data ingestion, LLM integration, database creation and querying, and serving the application on an endpoint. For more advanced RAG options, you can explore [rerankers](https://pathway.com/developers/api-docs/pathway-xpacks-llm/rerankers/#pathway.xpacks.llm.rerankers.CrossEncoderReranker) and the [adaptive RAG example](https://pathway.com/developers/showcases/adaptive-rag).\n", - "For implementing this example using open source LLMs, here\u2019s a [private RAG app template](https://pathway.com/developers/showcases/private-rag-ollama-mistral) that you can use as a starting point. It will help you run the entire application locally making it ideal for use-cases with sensitive data and eXplainable AI needs.\n", + "For implementing this example using open source LLMs, here\u2019s a [private RAG app template](https://pathway.com/developers/showcases/private-rag-ollama-mistral) that you can use as a starting point. It will help you run the entire application locally making it ideal for use-cases with sensitive data and explainable AI needs. You can do this within Docker as well by following the steps in [Pathway\u2019s LLM App templates](https://github.com/pathwaycom/llm-app/tree/main/examples/pipelines/gpt_4o_multimodal_rag) repository.\n", "\n", "\n", "To explore more app templates and advanced use cases, visit [Pathway App Templates](https://pathway.com/developers/showcases) or Pathway\u2019s [official blog](https://pathway.com/blog)."