Thilo #4

magnetilo · 2024-11-04T10:04:47Z

First iteration for the PDF structure export.

Extraction: ChatGPT with File upload
Evaluation: Different metrics and HTML comparison files using difflib and mlflow (with models from code https://mlflow.org/docs/latest/models.html#models-from-code).

Intermediate Conclusion
Both, extraction and evaluation, are quite tricky..
So far, the structure extraction with ChatGPT-File upload (see file
research/structure-extraction/scripts/chatgpt_file_upload.ipynb)
doesn’t work stable. Every run gives different results, some really good,
others not so good. The PDF from Kanton ZH could never been parsed so far.
Other extraction methods should be investigated, e.g.:

Nuextract: https://huggingface.co/learn/cookbook/en/information_extraction_haystack_nuextract
LlamaParse: https://docs.llamaindex.ai/en/stable/llama_cloud/llama_parse/
PyMuPDF4LLM: https://pymupdf.readthedocs.io/en/latest/pymupdf4llm/index.html
...

Notes:

I added the mlflow runs to the git repository at demokratis-ml/tree/thilo/research/structure-extraction/scripts/mlruns.
Maybe this is not very nice because of the many files, but it is convenient, so that everybody can check the ml runs using:

mlflow ui

I had some dependecy issues when installing venv from pyproject.toml. So for the meantime, I installed my packages from requirements.txt in a mamba environment..

vitawasalreadytaken · 2024-11-06T11:44:17Z

Hi Thilo, thank you for these first steps. A couple remarks:

This project uses uv for package management. I will mention this in the Readme. Maybe that's why the dependency management felt unfamiliar. To run the code from this PR, I installed your requirements with uv pip install -r requirements.txt – this is not the usual way uv is used but it worked fine for running this PR.
With cd research/structure-extraction/scripts && uv run mlflow ui I was able to view your committed mlflow runs and metrics, but not any artifacts.
When I wanted to run the notebook myself, I had to:
- cd research/structure-extraction/scripts
- mkdir tmp
- rename/remove the committed mlruns/ directory, otherwise mlflow was throwing permission errors about /Users/thiloweber.
I could then run the notebook and generate the artifacts into a fresh mlruns/ directory ✅

And suggestions:

Would it be possible to place the model (chatgpt_file_upload_model.py) into a standard file? It's pretty much impossible to do code review on Jupyter notebooks in Github, so at least some code would be easier to access in the PR 🙂
I see the prompt suggests a JSON structure that's maybe inspired by my first, sample prompts. But for production we discussed this new JSON schema documented here: Design a simple JSON schema for structure extraction via LLMs #3 Let's use this one instead.
Would it make sense to use the recently introduced strict mode in the API? https://openai.com/index/introducing-structured-outputs-in-the-api/

It's very cool to see this in action! I love the HTML diff view, that's a great way to check the outputs.

magnetilo · 2024-11-18T08:56:00Z

Hi Vita, thank you very much for testing and for your feedback!

I have now updated the code with two additional parsing models and some cosmetics (see research/structure-extraction/README.md for details).

Notes:

Considering your problems with MLflow: They should now be resolved. As described in research/structure-extraction/README.md, we now should start a local tracking server using cd research && mlflow ui for logging the experiments. This creates relative artifact paths. I hope you are now able to fully watch all my experiments.
I moved the models into a separate python file each (e.g., chatgpt_file_upload_model.py), so you can watch them on github. The evaluation of the models is still done with one single Jupyter Notebook: research/structure-extraction/scripts/run_model_evaluation.ipynb.
I adapted the models to output the desired JSON structre described in the schema. Although, I think, this schema lets still a little room for different interpretations (e.g., ChatGPT outputs nested lists in one "list" node, and no additional "list" node for the second order list...).
To date, structured outputs through submitting a JSON-schema are not available with the assistant API from openai in combination wit file-search. However, I used structured outputs in the third approach, where I use the completions API from openai.

Next steps:
As you can see from the README, there is still room for improvement. However, I think the second model (2 - LlamaParse and manual parsing) shows quite stable results.

So before starting to do another optimization iteration, I think it is time now to integrate the parsing model into a full data processing pipeline, including a manual refinement step using the HTML-Diffs for each PDF. This will show if the parsing model is already good enough, or else, where we need to improve.

What do you think?

… & re-run experiments

More PDF examples for testing parsing

…arsing for 30 PDF files of different cantons

magnetilo · 2024-12-13T17:06:42Z

Hi Vita,

I optimized the llama_parse_markdown_model.py pipeline and run the strucutre-parsing for 30 PDF files of different cantons.

Generally, the runs are quite fast (ca. 2 minutes) and the results look quite good.
I run all PDFs two times, this used about 5000 LlamaParse credits. As I understand, I have 7000 free credits per week and afterwards it costs $3/1000 credits. So, it should have been all free so far...

I updated the README with following information about the results of the 30 runs:

Parsing Models

Notes: Currently, only Model 2 (LlamaParse and manual parsing) is fully functional when running with the evaluation notebook:

scripts/run_model_evaluation.ipynb

Model 2 reads data from sample-documents directory and saves intermediate results to intermediate_results and outputs to sample-outputs, retaining the same file structure in all directories.

intermediate_results files:

..._llamaparse.pkl: Llamaparse output in document structure saved for caching.
..._llamaparse.md: LlamaParse markdown output (for analysis purpose).
..._footnotes.md: Markdown after replacing footnotes in text (for analysis purpose).

sample-outputs files:

..._json_schema.json: Parsed file (output).
..._pypdf2_diff.html: HTML diff file (for analysis purpose).

MLflow artifacts:

evaluation_all_files_....csv: CSV with all files and avg_percnt_missing_chars, percnt_added_chars, and valid_schema for each run (can be used to get a quick overview over all runs).
model files

vitawasalreadytaken · 2025-01-06T15:31:29Z

Hi @magnetilo, thank you very much for this update and testing all the sample documents. I'm in the process of comparing the results with the original PDFs. To summarise:

The model generally works pretty well, even with very diverse documents 👍
The most common issues seem to be:
- Inaccurate hierarchy of headings, sections, etc. This was also the problem with all other tools I've tried, and I guess we'll just have to fix the hierarchy in manual reviews.
- Footnotes – the content sometimes gets lost.
- Separating labels from their list items – sometimes this doesn't work for deeply nested lists.
- Occasional English-language LLM output mixed into document content.

These things seem fixable to me. Either by tweaking the prompt or in subsequent preprocessing.

I haven't analysed all cantons yet but these are the notes I have so far:

AG:
- Multi-page table - a very difficult document. The first page is parsed very well, but then the table doesn't continue.
AI: Parsed very well overall.
- The section hierarchy isn't nested perfectly.
- There are unexpected "labels" for some headings.
BE: Good result overall.
- Same issue with unexpected heading "labels" and imperfect nesting.
- In doc#51795, footnotes are not converted from their markdown code ([^xxx]) into JSON nodes. (Will be easy to fix.)
- On page 3 of doc#51795, footnote content is lost!
- doc#51795 contains some LLM commentary: "Here is the document content formatted as markdown:"
CH (federal): Parsed very well.
- doc#53136 contains some LLM description of an image: "The image appears to be a blank white page. There is no visible text, markings, or content of any kind that I can discern or describe. [...]"
FR: Very good.
- List items under Art 22a/2 don't have labels separated in the JSON.
GE:
- Heading labels such as "Chapitre I" or "Art. 1" are not separated in the JSON. Maybe because the prompt doesn't tell LlamaParse that the document is in French?
- Chapitre II/Art. 9/1: list items don't have labels separated in the JSON. (Similarly as for FR. Maybe this happens for deeply nested lists?)
GR: Very good.
- Note: extra formatting like bold type and strike-through is preserved in Markdown inside the JSON nodes. This is good because we will want to preserve this formatting somehow. Not sure yet if it's going to be Markdown or JSON or some other markup DSL.
JU: Generally good.
- The section headings in the left margin are a bit confusing for the model
- Some footnote contents are lost (similar to BE)
LU: Very good.
- Similarly to some previous cases, nested list labels in §25a/3 and §28/1 are not separated in JSON.
NE: Very good.
- Unusual document: the actual draft only starts on page 39 (line 4856 in the JSON). Despite this, the draft parsing doesn't seem affected and works comparably to other French-language documents.
NW: Pretty good.
- In doc#51865, instead of footnote content we get this from the LLM: [^[Note: The footnote content is not provided in the image]]. This document is difficult because footnotes appear at the very end, not on the page where they're referenced.
- Structure is modelled quite well despite the original PDF having a somewhat confusing formatting.
- Footnote content is lost in doc#51870.
- Doc#51876 has both footnote problems.
OW: Both documents are problematic because they are formatted as large tables.
- A node type of "table" is invented here which isn't allowed by our JSON schema yet. But it will probably make sense to add it.

I will continue with the remaining cantons.

magnetilo · 2025-01-06T17:34:37Z

@vitawasalreadytaken, thank you for the detailed reporting. I'm happy that the approach seems ok.
Here a few answers to your points:

Language: Yes, I'm currently submitting "de" as language everywhere. Maybe submitting "fr" for french documents, or dropping language entirely might help with minor problems in french documents.
List items with labels like "Art 22a/2", "§25a/3", ...: This might be solved relatively easily in the regex expression that parses markdown to json. (There is a test for this function. So these examples could be added to the tests.)
Other list parsing problems: Other problems with parsing the correct structures of lists might be more difficult to correct.. here the custom markdown to json is getting quickly complex.
Footnotes at very end of a document: This could be checked in the "parsing footnotes" function, if this could be accounted for easily.
English commentaries: I didn't arrive in stopping LlamaParse from making such comments. Eventhough I expliclity tell it in the prompt not to.
"table" label for json structure: Yes, I added this label explicitly in order to parse small tables in a document into markdown formatted tables in seperate json element.
Other json formatting like strikethrough, bold, ...: I explicetly tell the model to format this text in the prompt. Otherwise, strikethrough text gets lost sometimes.. I liked this formatting pretty well.

vitawasalreadytaken · 2025-01-08T13:31:22Z

Hi Thilo, thank you for the commentary. I haven't tried modifying the code yet but I did finish analysing the outputs for the remaining cantons. For future reference, here are my notes. There aren't any terribly surprising issues that weren't already present in other cantons. So my conclusion is that this LlamaParse-based approach will work for all cantons – maybe with the exception of those that publish their drafts in the form of tables. Those will be the hardest, but I believe it's still solvable.

SG: Very good
- Only the usual imperfections with splitting nested lists.
SH: Pretty good
- Again, "Here is the document content formatted as markdown:" is inserted into the document 🙂
- Usual issues with splitting list item labels
- There is a minor inaccuracy with some list items: in the document the labels are just "1", "2", etc. but in JSON they are "1.", "2." etc. (periods are added).
SO:
- The most complex document: there is the report first, and then multiple (!) drafts, each followed with its own synoptic table. This begins on page 14 (line 1439 in the JSON).
- Parsing is okay, though with all the issues seen elsewhere (missing footnote content, inaccurate lists, LLM messages in English etc.)
- The hardest problem with a document like this would be selecting only the relevant parts to extract. Then the parsing issues will be fixable.
SZ: Pretty good
- Footnote content is missing; in the original document the footnotes are at the very end
- Page headers and footers (page numbers) are not removed from this document
- Otherwise, standard issues
TI: Very good, even though it's in Italian
- Usual issues, mainly with lists
- There are no footnotes in this particular document
UR:
- Similar to AG, this is a multi-page table. Only the first page is parsed as a table, the rest of the document isn't very comprehensible once parsed.
- It seems they use red colour to indicate newly inserted text. This information is lost in the extraction. Side note: conveying meaning through colour like this probably isn't accessible to visually impaired people either.
ZG: Very good
- Usual minor issues, as observed in other documents
ZH:
- Very complex document again: a report followed by the draft in the form of a synoptic table
- The table structure falls apart and isn't comprehensible after parsing

Thilo Weber and others added 5 commits November 1, 2024 15:26

structure-parsing: chatgpt_file_upload notebook & evaluation with mlflow

2d1a839

structure-parsing: chatgpt_file_upload notebook & evaluation with mlflow

9c543b6

remove mlruns

f190b77

remove mlruns & tmp

8cb390e

tidy up pyproject.toml and requirements.txt

b8a0a18

Thilo Weber and others added 4 commits November 9, 2024 16:40

llama_parse... wip...

b5cbb7a

iteration 2: add LlamaParse models

db086c4

iteration 2: add LlamaParse models

3aab64e

iteration 2: add LlamaParse models

aba4dcf

Thilo Weber and others added 6 commits November 20, 2024 12:50

add missed changes...

e53dd36

correct src.evaluation.evaluate.percnt_missing_and_added_characters()…

d0c7483

… & re-run experiments

one more mlflow run...

f7afa68

Merge branch 'main' into thilo

bdbde9e

More PDF examples for testing parsing

working on running test batch...

5475e6c

optimizing models/llama_parse_markdown_model.py & running strucutre-p…

6ac5398

…arsing for 30 PDF files of different cantons

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Thilo #4

Thilo #4

magnetilo commented Nov 4, 2024

vitawasalreadytaken commented Nov 6, 2024

magnetilo commented Nov 18, 2024

magnetilo commented Dec 13, 2024

vitawasalreadytaken commented Jan 6, 2025

magnetilo commented Jan 6, 2025

vitawasalreadytaken commented Jan 8, 2025

Thilo #4

Are you sure you want to change the base?

Thilo #4

Conversation

magnetilo commented Nov 4, 2024

vitawasalreadytaken commented Nov 6, 2024

magnetilo commented Nov 18, 2024

magnetilo commented Dec 13, 2024

Parsing Models

vitawasalreadytaken commented Jan 6, 2025

magnetilo commented Jan 6, 2025

vitawasalreadytaken commented Jan 8, 2025