Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Add support for searching in PDFs #668

Merged
merged 12 commits into from
Jan 8, 2025
Merged

Add support for searching in PDFs #668

merged 12 commits into from
Jan 8, 2025

Conversation

alfredgrip
Copy link
Contributor

Adds support for searching in our documents. Both ones fetched from GitHub which are stored in Prisma, and those from MinIO (in the "public" bucket).

I chose to do a distinction between governing documents and meeting documents to enable filtering the results to separate these two, even though they share very much.

Currently: logic around MinIO is wrapped in fileHandler, which for some functions requires authorization from users. Since the requests to MinIO here isn't done by a user, a user is faked. This doesn't feel like a nice way to solve this. Maybe we could have a global "read only server" user, but that solution IMO doesn't belong in this PR.

@alfredgrip alfredgrip changed the title Search pdf Add support for searching in PDFs Jan 6, 2025
@alfredgrip
Copy link
Contributor Author

Closes #491

@alfredgrip alfredgrip linked an issue Jan 6, 2025 that may be closed by this pull request
@danieladugyan
Copy link
Member

Shouldn't this return at least one result - our "stadgar"? I tried running pdftotext manually on stadgar.pdf and the output text contains the word "upplösas"
image

@danieladugyan
Copy link
Member

Shouldn't this return at least one result - our "stadgar"? I tried running pdftotext manually on stadgar.pdf and the output text contains the word "upplösas".

Nevermind, I don't think stadgar was indexed on my machine. I'll look into it some more while reviewing.

@danieladugyan
Copy link
Member

This is not relevant for this PR, but in the future we'll be able to highlight specific parts of the PDF in Chrome which I think would be really cool.

@alfredgrip
Copy link
Contributor Author

Shouldn't this return at least one result - our "stadgar"? I tried running pdftotext manually on stadgar.pdf and the output text contains the word "upplösas".

Nevermind, I don't think stadgar was indexed on my machine. I'll look into it some more while reviewing.

A tip when debugging is to open meilisearch in the browser on localhost:7700 . Lets you easily see what is indexed

@danieladugyan
Copy link
Member

Shouldn't this return at least one result - our "stadgar"? I tried running pdftotext manually on stadgar.pdf and the output text contains the word "upplösas".

Nevermind, I don't think stadgar was indexed on my machine. I'll look into it some more while reviewing.

I figured out why stadgar and reglemente are not indexed - they are not stored in the database; instead, they're just hard-coded into the source code. I would probably ignore that for now since that's just stupid...we really need to store references to all of our documents in the database.

Copy link
Member

@danieladugyan danieladugyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I made some suggestions, but I think they are mostly pretty straight-forward.

Looking forward to merging this, it's really cool that you got it working! 🌟

src/lib/components/search/DocumentSearchResult.svelte Outdated Show resolved Hide resolved
src/lib/components/search/SearchResultList.svelte Outdated Show resolved Hide resolved
src/lib/search/syncDocuments.ts Outdated Show resolved Hide resolved
src/lib/search/syncDocuments.ts Outdated Show resolved Hide resolved
src/lib/search/syncDocuments.ts Outdated Show resolved Hide resolved
Copy link
Member

@danieladugyan danieladugyan left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nice one! I fixed the types causing CI to fail (and the one in DocumentSearchResult) so let's merge 👍

@alfredgrip alfredgrip merged commit 4be88c8 into main Jan 8, 2025
3 checks passed
@alfredgrip alfredgrip deleted the search-pdf branch January 8, 2025 21:46
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

Search feature for meeting documents
2 participants