-
Notifications
You must be signed in to change notification settings - Fork 1
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Thilo #4
base: main
Are you sure you want to change the base?
Thilo #4
Conversation
Hi Thilo, thank you for these first steps. A couple remarks:
And suggestions:
It's very cool to see this in action! I love the HTML diff view, that's a great way to check the outputs. |
Hi Vita, thank you very much for testing and for your feedback! I have now updated the code with two additional parsing models and some cosmetics (see Notes:
Next steps: So before starting to do another optimization iteration, I think it is time now to integrate the parsing model into a full data processing pipeline, including a manual refinement step using the HTML-Diffs for each PDF. This will show if the parsing model is already good enough, or else, where we need to improve. What do you think? |
… & re-run experiments
More PDF examples for testing parsing
…arsing for 30 PDF files of different cantons
Hi Vita, I optimized the llama_parse_markdown_model.py pipeline and run the strucutre-parsing for 30 PDF files of different cantons.
I updated the README with following information about the results of the 30 runs: Parsing ModelsNotes: Currently, only Model 2 (LlamaParse and manual parsing) is fully functional when running with the evaluation notebook:
Model 2 reads data from intermediate_results files:
sample-outputs files:
MLflow artifacts:
|
Hi @magnetilo, thank you very much for this update and testing all the sample documents. I'm in the process of comparing the results with the original PDFs. To summarise:
These things seem fixable to me. Either by tweaking the prompt or in subsequent preprocessing. I haven't analysed all cantons yet but these are the notes I have so far:
I will continue with the remaining cantons. |
@vitawasalreadytaken, thank you for the detailed reporting. I'm happy that the approach seems ok.
|
Hi Thilo, thank you for the commentary. I haven't tried modifying the code yet but I did finish analysing the outputs for the remaining cantons. For future reference, here are my notes. There aren't any terribly surprising issues that weren't already present in other cantons. So my conclusion is that this LlamaParse-based approach will work for all cantons – maybe with the exception of those that publish their drafts in the form of tables. Those will be the hardest, but I believe it's still solvable.
|
First iteration for the PDF structure export.
Intermediate Conclusion
Both, extraction and evaluation, are quite tricky..
So far, the structure extraction with ChatGPT-File upload (see file
research/structure-extraction/scripts/chatgpt_file_upload.ipynb
)doesn’t work stable. Every run gives different results, some really good,
others not so good. The PDF from Kanton ZH could never been parsed so far.
Other extraction methods should be investigated, e.g.:
Notes:
demokratis-ml/tree/thilo/research/structure-extraction/scripts/mlruns
.Maybe this is not very nice because of the many files, but it is convenient, so that everybody can check the ml runs using: