Search feature for meeting documents #491

alfredgrip · 2024-09-19T12:06:21Z

Would be fun and useful if there was a fuzzy search feature for meeting documents. Sometimes you might want to find a specific motion but can't remember which meeting it was brought up on. I think since basically all our documents are LaTeX PDFs, there probably exist some tool that allows for indexing and searching amongst them

alfredgrip · 2024-09-20T08:50:59Z

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/
https://www.npmjs.com/package/node-poppler

alfredgrip · 2024-09-20T08:54:50Z

@01ste02 I vaguely remember you had a script that checked how many times your name was mentioned in guild documents, how did that work?

01ste02 · 2024-09-21T06:18:41Z

I found an old message on discord containing this script:

#!/bin/bash
PERSON="Axel Svensson"
LATEST_MEETING=26

# Usage: ./filename.sh "Namn" SiffraFörSistaStyrelsemöte

if [ $# -eq 1 ]
    then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
        LATEST_MEETING=$1
    else    
            PERSON=$1
    fi
elif [ $# -eq 2 ]
        then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
                LATEST_MEETING=$1
                PERSON=$2
        else
                PERSON=$1
                LATEST_MEETING=$2
        fi
fi


for fix in ".pdf" "_2.pdf"; do 
    for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
        cd /tmp
        #echo "https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix}"
        wget -q https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix} 2> /dev/null
        pdftotext -l 1 /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
        rm /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
    done
done
echo "${PERSON} har närvarat på ca $(grep -lrnw "${PERSON}" /tmp/protokoll_S* | wc -l) styrelsemöten" 2> /dev/null
echo "Dessa möten har ${PERSON} förmodligen närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
    if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 1 ]]; then
        echo "S${i}"
    fi
done
echo ""
echo "Vilket betyder att ${PERSON} förmodligen inte närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
        if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 0 ]]; then
                echo "S${i}"
        fi
done
rm -f /tmp/protokoll_S* 2> /dev/null

In essence, it downloads all protocols from that year and dumps them into text using pdftotext. The plaintext is then searched through. Beware that this script was build during a board meeting where I was extra bored and wanted to procrastinate, so the quality is not too great.. :)

If you want to build a search for the website, an easy solution is to dump all pdfs we upload to text and just search through those texts and display files containing the search string. This won't really get you "this page, this line, this column" unless you do some magic stuff when you dump to plaintext. If you prepend the page and row from the pdf to each line of plaintext, you could probably display that along with the file in the results.

Since the pdfs are user-uploaded, we just need to make sure that the file is actually a pdf so that we do not essentially execute arbitrary code that is uploaded as a .pdf...

So basically I had built what you suggested in:

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/ https://www.npmjs.com/package/node-poppler

alfredgrip · 2024-12-10T14:50:18Z

@danieladugyan mentioned that Apache Tika is a good alternative to this as well (https://tika.apache.org/).
I tried it out and got varying results. Dsek motions written with our new LaTeX templates worked very well, but when I tried it for our stadgar it didn't work well at all. Only headings was returned, but no text from the different chapters.
Commands I ran to try this was:
curl -X PUT --data-binary @stadgar.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
It handled other documents, like protocols, pretty good.

Poppler still manages to extract all text from our stadgar and all documents that I have tried. It seems to be a sharper tool, but it's a shame that it doesn't provide a server.

github-project-automation bot added this to Web Sep 19, 2024

github-project-automation bot moved this to 🆕 New in Web Sep 19, 2024

alfredgrip added the enhancement New feature or request label Sep 19, 2024

danieladugyan moved this from 🆕 New to 🎯 Todo in Web Oct 24, 2024

danieladugyan moved this from 🎯 Todo to 🏗 In Progress in Web Oct 24, 2024

danieladugyan assigned alfredgrip Oct 24, 2024

alfredgrip mentioned this issue Dec 10, 2024

Add a pdf to text service #636

Merged

alfredgrip mentioned this issue Jan 6, 2025

Add support for searching in PDFs #668

Merged

alfredgrip linked a pull request Jan 6, 2025 that will close this issue

Add support for searching in PDFs #668

Merged

alfredgrip closed this as completed in #668 Jan 8, 2025

github-project-automation bot moved this from 🏗 In Progress to ✅ Done in Web Jan 8, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Search feature for meeting documents #491

Search feature for meeting documents #491

alfredgrip commented Sep 19, 2024

alfredgrip commented Sep 20, 2024

alfredgrip commented Sep 20, 2024

01ste02 commented Sep 21, 2024

alfredgrip commented Dec 10, 2024

Search feature for meeting documents #491

Search feature for meeting documents #491

Comments

alfredgrip commented Sep 19, 2024

alfredgrip commented Sep 20, 2024

alfredgrip commented Sep 20, 2024

01ste02 commented Sep 21, 2024

alfredgrip commented Dec 10, 2024