Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Search feature for meeting documents #491

Closed
alfredgrip opened this issue Sep 19, 2024 · 4 comments · Fixed by #668
Closed

Search feature for meeting documents #491

alfredgrip opened this issue Sep 19, 2024 · 4 comments · Fixed by #668
Assignees
Labels
enhancement New feature or request

Comments

@alfredgrip
Copy link
Contributor

Would be fun and useful if there was a fuzzy search feature for meeting documents. Sometimes you might want to find a specific motion but can't remember which meeting it was brought up on. I think since basically all our documents are LaTeX PDFs, there probably exist some tool that allows for indexing and searching amongst them

@github-project-automation github-project-automation bot moved this to 🆕 New in Web Sep 19, 2024
@alfredgrip alfredgrip added the enhancement New feature or request label Sep 19, 2024
@alfredgrip
Copy link
Contributor Author

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/
https://www.npmjs.com/package/node-poppler

@alfredgrip
Copy link
Contributor Author

@01ste02 I vaguely remember you had a script that checked how many times your name was mentioned in guild documents, how did that work?

@01ste02
Copy link

01ste02 commented Sep 21, 2024

I found an old message on discord containing this script:

#!/bin/bash
PERSON="Axel Svensson"
LATEST_MEETING=26

# Usage: ./filename.sh "Namn" SiffraFörSistaStyrelsemöte

if [ $# -eq 1 ]
    then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
        LATEST_MEETING=$1
    else    
            PERSON=$1
    fi
elif [ $# -eq 2 ]
        then
        if [[ "$1" =~ ^-?[0-9]+$ ]]; then
                LATEST_MEETING=$1
                PERSON=$2
        else
                PERSON=$1
                LATEST_MEETING=$2
        fi
fi


for fix in ".pdf" "_2.pdf"; do 
    for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
        cd /tmp
        #echo "https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix}"
        wget -q https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix} 2> /dev/null
        pdftotext -l 1 /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
        rm /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
    done
done
echo "${PERSON} har närvarat på ca $(grep -lrnw "${PERSON}" /tmp/protokoll_S* | wc -l) styrelsemöten" 2> /dev/null
echo "Dessa möten har ${PERSON} förmodligen närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do 
    if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 1 ]]; then
        echo "S${i}"
    fi
done
echo ""
echo "Vilket betyder att ${PERSON} förmodligen inte närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
        if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 0 ]]; then
                echo "S${i}"
        fi
done
rm -f /tmp/protokoll_S* 2> /dev/null

In essence, it downloads all protocols from that year and dumps them into text using pdftotext. The plaintext is then searched through. Beware that this script was build during a board meeting where I was extra bored and wanted to procrastinate, so the quality is not too great.. :)

If you want to build a search for the website, an easy solution is to dump all pdfs we upload to text and just search through those texts and display files containing the search string. This won't really get you "this page, this line, this column" unless you do some magic stuff when you dump to plaintext. If you prepend the page and row from the pdf to each line of plaintext, you could probably display that along with the file in the results.

Since the pdfs are user-uploaded, we just need to make sure that the file is actually a pdf so that we do not essentially execute arbitrary code that is uploaded as a .pdf...

So basically I had built what you suggested in:

Idea so far: for all PDFs, generate text files with their content using Poppler which provides pdftotext. Connect PDFs with the text files and provide some search functionality for the text files, but return PDF results.

https://poppler.freedesktop.org/ https://www.npmjs.com/package/node-poppler

@danieladugyan danieladugyan moved this from 🆕 New to 🎯 Todo in Web Oct 24, 2024
@danieladugyan danieladugyan moved this from 🎯 Todo to 🏗 In Progress in Web Oct 24, 2024
@alfredgrip
Copy link
Contributor Author

@danieladugyan mentioned that Apache Tika is a good alternative to this as well (https://tika.apache.org/).
I tried it out and got varying results. Dsek motions written with our new LaTeX templates worked very well, but when I tried it for our stadgar it didn't work well at all. Only headings was returned, but no text from the different chapters.
Commands I ran to try this was:
curl -X PUT --data-binary @stadgar.pdf http://localhost:9998/tika --header "Content-type: application/pdf"
It handled other documents, like protocols, pretty good.

Poppler still manages to extract all text from our stadgar and all documents that I have tried. It seems to be a sharper tool, but it's a shame that it doesn't provide a server.

@alfredgrip alfredgrip linked a pull request Jan 6, 2025 that will close this issue
@github-project-automation github-project-automation bot moved this from 🏗 In Progress to ✅ Done in Web Jan 8, 2025
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
enhancement New feature or request
Projects
Status: ✅ Done
Development

Successfully merging a pull request may close this issue.

2 participants