Generalize RAG + PDF Chat feature #641

mishig25 · 2023-12-18T14:16:47Z

TLDR: implement PDF-chat feature

Update 1 here

Closes #609

When user uploads a PDF:

Parse the PDF text, create embeddings, save embeddings in files bucket (that is also used for saving images for multimodal models)
On the next messages in that conversation, use the PDF embeddings for RAG

Limitations

Parse first 20 pages of pdf (can increase it or decrease it)
A conversation can currently have only one uploaded PDF. When a user uploads PDF, it overwrites the existing PDF if there was any
When user enables websearch, then websearch RAG is used, PDF RAG is not used.
Just like Websearch Rag, when Pdf rag is enabled, every message of that conversation will use PDF Rag. (In subsequent PR, we need to use prompting and other techniques that will make the tool usage only when it makes sense)

Testing locally

*install new pdf-parse dependency with npm ci

npm ci 
npm run dev -- --open

Screen recording

Testing by uploading Mamba paper

Screen.Recording.2023-12-18.at.3.46.40.PM.mov

nsarrazin · 2023-12-18T15:00:41Z

🔥 So cool, will test it locally later, but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

mishig25 · 2023-12-18T15:07:51Z

but just from looking at the demo, do you think there's an easy way to show an indicator of when a PDF is already uploaded and will be sent with the message ?

at the moment, there is websearch-like box that indicates pdf rag was used

nsarrazin · 2023-12-18T15:12:46Z

I meant more when the file is loaded and before the conversation is started, like for images:

I guess it would look a bit different since you can only have one PDF per conversation, but it would be nice to have an indication that a PDF will be used to answer the query 👀

mishig25 · 2023-12-18T15:15:01Z

I see. Let me think about it

gary149 · 2023-12-20T10:31:36Z

src/routes/conversation/[id]/upload-pdf/+server.ts

+	const loadingTask = pdfjsLib.getDocument({ data });
+	const pdf = await loadingTask.promise;
+
+	const N_MAX_PAGES = 20;


This seems a bit low, 100 or 200 pages maybe?

I see. 100 or 200 would be bit slow to create embeddings on CPU using the current solution of transformers.js (I will provide actual benchmark numbers).

In this case, should I review/push for the community PR #646 that makes it possible to have an embedding endpoint for faster embeddings creation. We can still use tfjs embeddings (the current approach) for websearch, and use TEI-powered embeddings endpoint for PDF embeddings creation & possibly for assistants #639 (if users can upload documents). Wdyt ?

src/lib/buildPrompt.ts

gary149 · 2023-12-20T10:43:35Z

src/lib/components/UploadBtn.svelte

 	/>
-	<CarbonUpload class="mr-2 text-xs " /> Upload image
+	<CarbonUpload class="mr-2 text-xs " />
+	{#if uploadPdfStatus === PdfUploadStatus.Uploaded}


Yes as already said, you probably want to display uploaded files somewhere in the UI.

handled in #641 (comment)

nsarrazin

The feature itself works super well, I'm super impressed by what it can do tbh 🔥

There's just a couple things that are not super clear to me around the user experience that I think could be improved:

Why does it create an empty conversation on file upload ? Seems to me like it should show that a PDF has been uploaded client-side, and create the conversation with the PDF only when the first message is sent, the way we deal with images
Would be cool to have an indicator in a conversation that shows that a PDF is available in a specific conversation. Like something at the top of the conversation that says "${filename}.pdf is available in this conversation" or something
It's not super clear from the UI perspective what happens when you upload a PDF to a conversation that already has one. I think it silently replaces the old PDF by the new one, but it's a bit confusing, I think we could just make it so that PDFs can only be uploaded at the beginning of a conversation, wdyt?

I also left a couple of fixes for type checking as comment/suggestions in the PR

nsarrazin · 2023-12-21T12:24:55Z

src/lib/types/PdfChat.ts

+	context: string;
+}
+
+/* eslint-disable no-shadow */


Suggested change

/* eslint-disable no-shadow */

I think this can be removed by changing .eslintrc.cjs

- "no-shadow": ["error"], + "@typescript-eslint/no-shadow": "error",

nsarrazin · 2023-12-21T12:29:42Z

src/lib/components/OpenPdfSearchResults.svelte

@@ -0,0 +1,114 @@
+<script lang="ts">


Would it have been possible to reuse OpenWebSearchResults.svelte and just make it a generic OpenResults maybe ? Looking at the diff between the two it looks like the only difference is the button name ("PDF search" and "Web search") and the input type

src/routes/conversation/[id]/upload-pdf/+server.ts

nsarrazin · 2023-12-21T13:13:28Z

src/routes/+page.svelte

 </script>

 <svelte:head>
 	<title>{PUBLIC_APP_NAME}</title>
 </svelte:head>

 <ChatWindow
-	on:message={(ev) => createConversation(ev.detail)}
+	on:message={(ev) => createConversationWithMsg(ev.detail)}
+	on:uploadpdf={(ev) => createConversationWithPdf(ev.detail)}


I'm not sure why we create an empty chat with a document, seems to me like it should be done like for images, where the files are "stored" in the front-end and the conversation created along when the first message is sent ?

src/routes/conversation/[id]/upload-pdf/+server.ts

src/lib/components/chat/ChatWindow.svelte

src/lib/components/UploadBtn.svelte

nsarrazin · 2023-12-21T14:31:26Z

Small note, if I drag and drop a non-pdf file I get the following weird output

mishig25 · 2024-01-09T12:53:43Z

Updates:

Fixed this upload non-img file bug here
Provided better UI/UX for uploaded file (see attached video). Specifically: 1. name if the uploaded pdf with pdf icon appears; 2. this pdf file name and icon does "blinking" animation while the pdf is being uploaded & embeddings are being created; 3. on hover x btn appears, that let's you delete the uploaded PDF file
env var (config) for enabling pdf-chat feature as here

Screen.Recording.2024-01-09.at.1.45.02.PM.mov

What I'm working on now:

Address nathan's comments here & here
Test creating embeddings on a large PDF file with TEI Add embedding models configurable, from both transformers.js and TEI #646

wdhorton · 2024-01-09T22:18:49Z

Thanks for working on this! One question I had is: what was your thought process for adding a new upload button for PDFs, versus using the existing drag-and-drop functionality that already exists for images?

mishig25 · 2024-01-10T09:27:22Z

@wdhorton currently the UI might still evolve. For now, the reason for resuing the same upload btn (instead of adding a new upload btn) is: having two different upload btns makes the UI look cluttered, especially on smaller screens

itaybar · 2024-01-10T11:44:25Z

I have few questions:

Why you are limiting this feature to PDF and not csv, txt, etc?
I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)
Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size
Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Co-authored-by: Nathan Sarrazin <[email protected]>

* Generlize RAG * wip * fix casting

mishig25 · 2024-01-12T10:40:20Z

Update: this PR is getting big. Unfortunately, there is no other option (I think). The specific points are:

In Generlize RAG #689, I've generalized RAG. What does it mean? It means two things: 1. server side. 2. frontend side. In terms of 1. server side, RAG applications have to implement a RAG interface for consistency & better organization of codebase (you can checkout directory src/lib/server/rag). In terms of 2. server side, OpenWebSearchResults svelte component is generalized into OpenRAGResults svelte component that will show up on RAG augmented messages (as suggested here).
Besides creating pdf embeddings through TEI for pdf-chat, we would need vectorDB support for multiple reasons:
1. without vectorDB, the pdf-chat session will lose the pdf embeddings (for instance, when you close your browser & re-open the same chat-ui conversation, the pdf embeddings will no longer be available)
2. storing PDF embeddings on mongo gridFS would slow-down performance (as questioned here) & large number of embeddings can cause a lot of latency on the server since findSimilarSentences runs locally on the server. Therefore, we would need support for vectorDB.
3. We would need vectorDB for other features as well. For instance, we would need vecorDB + pdf RAG for Assistants feature #639. There was also internal slack discussion here.
4. VectorDB support is more general feature that can have multiple applications. For instance, PDF-chat is just an instance (special case) of vectorDB chat since in PDF-chat, one uses the PDF to populate VectorDB & afterwards it becomes just chat with vectorDB
5. for the info, checking openai/chatgpt-retrieval-plugin to see if we would need to follow commonly used API for vectorDBs

Should I open a PR for vectorDB support against this branch? wdyt @nsarrazin @gary149

itaybar · 2024-01-12T10:47:08Z

@itaybar, thanks a lot for your questions

Why you are limiting this feature to PDF and not csv, txt, etc?

Yes, we will add support to other text files. Once this PR is done, supporting other text files would be trivial. (Might even include as part of this PR)

I don't sure you have to use embedding for PDF (or at least make it optional, and not mandatory)

Could you elaborate on it. And what would be the alternatives?

Why using Mongo as vectorDB and calculate the vector similarity in client side instead of using some real vectorDB that can to it whey more efficient and fast than running it in JS, this can make the code cleaner and remove any limitation of content size. Does storing all the embedding is good idea? this can make the DB blow relatively fast.

Indeed this is a good point. I/we will add support for vectorDB (likely part of this PR)

Thanks for the quick response guys!
About the optional embedding for the pdf, correct me if I wrong, but you can just read the texts from the pdf without using embedding for the images, etc and by that removing the hard limits for content file

itaybar · 2024-01-16T14:04:56Z

@mishig25 What is your estimation for merging this in your opinion?

mishig25 · 2024-01-16T16:38:04Z

Can't give exact date. But will do my best to merge it soon :)

zzen0008 · 2024-01-20T12:39:35Z

src/routes/conversation/[id]/+server.ts

+				pdfSearchResults = await RAGs["pdfChat"].retrieveRagContext(conv, newPrompt, update);
+			}
+
+			messages[messages.length - 1].ragContext = pdfSearchResults;


This causes a bug with web search.
messages[messages.length - 1].ragContext = webSearchResults is being overridden with undefined if no pdf document has been uploaded by the user.

Submitted a PR to fix the issue #745

johndpope · 2024-02-03T21:48:15Z

Merge. hire 10 more people to help.

mishig25 · 2024-02-04T10:40:59Z

Merge

Will merge soon

hire 10 more people to help.

Hiring 10 more people rarely results in 2x productivity (let alone 10x)

flexchar · 2024-02-06T15:10:32Z

It's a tough one to implement. I appreciate the work on this one.

I had a couple of thoughts on this I'd like to share.

First, I thought of a plugin-like system that is registered based on the file type, has handler for processing/storing and handler for retrieval. That would allow community to scale while not choking the HF developers which I'm beyond impressed being able to deliver such variety of products.

Alternatively, it could also be a third party API - much like OpenAI function calling works - so that the responsibility is NOT on you but on the end user who chooses to deploy. It's great for developers but would probably be a pain for the those who just one to feel the power of deploying and has no use beyond (inspired by the shutdown story of banana.dev).

I believe these would inherently fit better as the nature of open-source deployments is to customize. As such, there is an infinite number of use cases and solutions... PDFs, images, audio files, web search as input; ChromaDB, Pinecone, Qdrant, PGVector, Meilisearch as storage/retrieval to name a few.

zubu007 · 2024-02-12T13:46:42Z

Hello. I have cloned the "chatPDF" branch to use the pdf upload feature. It works locally on my machine when I run npm run dev. However, when I perform the same thing on my apache2 server, it shows an uploading PDF error. I attaching the screenshot.

I have tried copying the exact .env.local file from my local directory to my server's directory but it still shows the same error. Should I open a new issue about it? What else can I provide to help you understand the error?

flexchar · 2024-02-12T17:27:43Z

Can you ideally skip apache2? Or check error logs from it to see if anything is being logged? Could be that some headers are lost or file being too big and blocked. What does your network devtools say on this upload response?

@zubu007

zubu007 · 2024-02-16T10:18:14Z

The error on the console shows 403 Forbidden error meaning it has something to do with authentication. I am using custom models with separate endpoint rather than the default. However, the chat function with the custom model works as expected. When the pdf is being uploaded, it throws the 403. My question is, the pdf fetch function is using a different authentication?

Let me add further console logs in both my local machine and server to see the difference. I will let you know the results here.

windprak · 2024-02-19T10:04:36Z

I have the same problem as #693 It retrieves correctly when looking at the prompt and parameters, but the model answers as if it has gotten only the question.

zubu007 · 2024-03-04T10:43:02Z

Let me state a problem I am facing simply.
Using the chatPDF with the default HF model ("name": "mistralai/Mistral-7B-Instruct-v0.1") it works. But I am trying to add model to the .env.local file as an openai endpoint and use that model for the chatPDF. The chat function works with both model (default and ours) but the pdf-chat only works with the default one. With the same promt, same pdf, same parameters.
I am adding pictures for better understanding.

For HF default model, this was the promt

And this was the response

For our own endpoint,

And this was the response

Is there something I am missing? It must be something simple because changing the model in the App, one works and the other doesnt. If you need more information about the error let me know.

lmaosweqf1 · 2024-03-27T09:19:33Z

hey, whats the current status of the PR?

C-Loftus · 2024-04-05T20:45:58Z

Also very interested in this feature. Is there help needed for this? Wasn't sure the blocker or how to help.

mansur-abdirimov · 2024-06-08T16:13:57Z

@mishig25 Thanks for working on this! Very interested in this feature as well thus curious if there are any updates it?

ruizcrp · 2024-06-12T14:51:15Z

Hi @mishig25 and all, from my side I would also offer to help in my freetime if I can, as this is quite an important feature. Reading the above I think that an overview of the required steps is important or someone that coordinates tasks. E.g. is something missing in PR #745 ? Or is a redesign of the entire RAG-logic including externalization of the API required? Or is there maybe a meeting needed with some overarching architecture etc.?

antonkulaga · 2024-10-04T01:05:38Z

I am really confused, current version of the chat doesn't seem to have the file upload despite this PR being sent a long ago. WHY???

johndpope · 2024-10-04T01:27:27Z

it's better to just have 1 employee solving the issue - especially when the company is valued at $4,500 million dollars...

bpawnzZ · 2024-10-21T17:59:23Z

why has this not be added?

mishig25 changed the title ~~Implemend PDF-chat feature~~ Implement PDF-chat feature Dec 18, 2023

mishig25 force-pushed the chatPDF branch from 50c0ddc to 73db81e Compare December 18, 2023 14:43

mishig25 marked this pull request as ready for review December 18, 2023 14:57

mishig25 requested review from nsarrazin and gary149 December 18, 2023 14:57

mishig25 mentioned this pull request Dec 18, 2023

Feature Request: Convert PDF to Markdown inside Chat UI #441

Closed

mikelfried mentioned this pull request Dec 19, 2023

Add embedding models configurable, from both transformers.js and TEI #646

Merged

gary149 reviewed Dec 20, 2023

View reviewed changes

nsarrazin reviewed Dec 21, 2023

View reviewed changes

nsarrazin added enhancement New feature or request front This issue is related to the front-end of the app. back This issue is related to the Svelte backend or the DB labels Dec 26, 2023

mishig25 force-pushed the chatPDF branch 2 times, most recently from ba5f9b1 to e9ffdab Compare January 9, 2024 11:12

mishig25 and others added 7 commits January 10, 2024 13:51

Implement PDF-chat feature

1785f3a

prettier

cf4eddf

correct usage of pdfjs-dist

b96bc8b

Updates from code reviews

9a0a1e6

Co-authored-by: Nathan Sarrazin <[email protected]>

fix file drag-n-drop

fd512dd

Better UI/UX feedback for uploaded pdf

9f29fa7

format

2ab33a2

Mishig and others added 2 commits January 12, 2024 11:14

Generlize RAG (#689)

8ffee0e

* Generlize RAG * wip * fix casting

fix typings

f8b2ec5

mishig25 force-pushed the chatPDF branch from ba8a2ef to f8b2ec5 Compare January 12, 2024 10:14

mishig25 changed the title ~~Implement PDF-chat feature~~ Generalize RAG + PDF Chat feature Jan 12, 2024

mishig25 mentioned this pull request Jan 15, 2024

Its not generating answer from uploaded pdf #693

Closed

zzen0008 reviewed Jan 20, 2024

View reviewed changes

mishig25 marked this pull request as draft January 22, 2024 08:51

zubu007 mentioned this pull request Feb 5, 2024

Where are the image and pdf upload features when running on locally using this repo? #774

Closed

This was referenced Jul 8, 2024

Add PDF to Markdown feature #442

Closed

Custom chatbot which includes sources such as pdf,databases and a specific website only. #471

Closed

Upload an Image file #217

Closed

Generalize RAG + PDF Chat feature #641

Are you sure you want to change the base?

Generalize RAG + PDF Chat feature #641

Conversation

mishig25 commented Dec 18, 2023 • edited Loading

TLDR: implement PDF-chat feature

Limitations

Testing locally

Screen recording

nsarrazin commented Dec 18, 2023

mishig25 commented Dec 18, 2023

nsarrazin commented Dec 18, 2023

mishig25 commented Dec 18, 2023

Choose a reason for hiding this comment

mishig25 Dec 20, 2023 • edited Loading

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsarrazin left a comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

Choose a reason for hiding this comment

nsarrazin commented Dec 21, 2023

mishig25 commented Jan 9, 2024 • edited Loading

wdhorton commented Jan 9, 2024

mishig25 commented Jan 10, 2024

itaybar commented Jan 10, 2024

mishig25 commented Jan 12, 2024 • edited Loading

itaybar commented Jan 12, 2024 • edited Loading

itaybar commented Jan 16, 2024

mishig25 commented Jan 16, 2024

Choose a reason for hiding this comment

Choose a reason for hiding this comment

johndpope commented Feb 3, 2024

mishig25 commented Feb 4, 2024 • edited Loading

flexchar commented Feb 6, 2024

zubu007 commented Feb 12, 2024

flexchar commented Feb 12, 2024

zubu007 commented Feb 16, 2024

windprak commented Feb 19, 2024

zubu007 commented Mar 4, 2024

lmaosweqf1 commented Mar 27, 2024

C-Loftus commented Apr 5, 2024

mansur-abdirimov commented Jun 8, 2024

ruizcrp commented Jun 12, 2024

antonkulaga commented Oct 4, 2024 • edited Loading

johndpope commented Oct 4, 2024

bpawnzZ commented Oct 21, 2024

mishig25 commented Dec 18, 2023 •

edited

Loading

mishig25 Dec 20, 2023 •

edited

Loading

mishig25 commented Jan 9, 2024 •

edited

Loading

mishig25 commented Jan 12, 2024 •

edited

Loading

itaybar commented Jan 12, 2024 •

edited

Loading

mishig25 commented Feb 4, 2024 •

edited

Loading

antonkulaga commented Oct 4, 2024 •

edited

Loading