-
Notifications
You must be signed in to change notification settings - Fork 14
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Search feature for meeting documents #491
Comments
Idea so far: for all PDFs, generate text files with their content using Poppler which provides https://poppler.freedesktop.org/ |
@01ste02 I vaguely remember you had a script that checked how many times your name was mentioned in guild documents, how did that work? |
I found an old message on discord containing this script: #!/bin/bash
PERSON="Axel Svensson"
LATEST_MEETING=26
# Usage: ./filename.sh "Namn" SiffraFörSistaStyrelsemöte
if [ $# -eq 1 ]
then
if [[ "$1" =~ ^-?[0-9]+$ ]]; then
LATEST_MEETING=$1
else
PERSON=$1
fi
elif [ $# -eq 2 ]
then
if [[ "$1" =~ ^-?[0-9]+$ ]]; then
LATEST_MEETING=$1
PERSON=$2
else
PERSON=$1
LATEST_MEETING=$2
fi
fi
for fix in ".pdf" "_2.pdf"; do
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
cd /tmp
#echo "https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix}"
wget -q https://minio.api.dsek.se/documents/public/2023/S${i}/protokoll_S${i}_2023${fix} 2> /dev/null
pdftotext -l 1 /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
rm /tmp/protokoll_S${i}_2023${fix} 2> /dev/null
done
done
echo "${PERSON} har närvarat på ca $(grep -lrnw "${PERSON}" /tmp/protokoll_S* | wc -l) styrelsemöten" 2> /dev/null
echo "Dessa möten har ${PERSON} förmodligen närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 1 ]]; then
echo "S${i}"
fi
done
echo ""
echo "Vilket betyder att ${PERSON} förmodligen inte närvarat på:"
for i in $(seq -f "%02g" 1 ${LATEST_MEETING}); do
if [[ $(grep -lrnw "${PERSON}" /tmp/protokoll_S${i}* 2> /dev/null | wc -l 2> /dev/null) == 0 ]]; then
echo "S${i}"
fi
done
rm -f /tmp/protokoll_S* 2> /dev/null In essence, it downloads all protocols from that year and dumps them into text using pdftotext. The plaintext is then searched through. Beware that this script was build during a board meeting where I was extra bored and wanted to procrastinate, so the quality is not too great.. :) If you want to build a search for the website, an easy solution is to dump all pdfs we upload to text and just search through those texts and display files containing the search string. This won't really get you "this page, this line, this column" unless you do some magic stuff when you dump to plaintext. If you prepend the page and row from the pdf to each line of plaintext, you could probably display that along with the file in the results. Since the pdfs are user-uploaded, we just need to make sure that the file is actually a pdf so that we do not essentially execute arbitrary code that is uploaded as a .pdf... So basically I had built what you suggested in:
|
@danieladugyan mentioned that Apache Tika is a good alternative to this as well (https://tika.apache.org/). Poppler still manages to extract all text from our stadgar and all documents that I have tried. It seems to be a sharper tool, but it's a shame that it doesn't provide a server. |
Would be fun and useful if there was a fuzzy search feature for meeting documents. Sometimes you might want to find a specific motion but can't remember which meeting it was brought up on. I think since basically all our documents are LaTeX PDFs, there probably exist some tool that allows for indexing and searching amongst them
The text was updated successfully, but these errors were encountered: