Skip to content

A Comprehensive Benchmark for Document Parsing and Evaluation

License

Notifications You must be signed in to change notification settings

jakep-allenai/OmniDocBench

 
 

Folders and files

NameName
Last commit message
Last commit date

Latest commit

 

History

4 Commits
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 
 

Repository files navigation

OmniDocBench

OmniDocBench is a benchmark for evaluating diverse document parsing in real-world scenarios, featuring the following characteristics:

  • Diverse Document Types: This benchmark includes 981 PDF pages, covering 9 document types, 4 layout types, and 3 language types. It encompasses a wide range of content, including academic papers, financial reports, newspapers, textbooks, and handwritten notes.
  • Rich Annotation Information: It contains localization information for 15 block-level (such as text paragraphs, headings, tables, etc., totaling over 20k) and 4 span-level (such as text lines, inline formulas, subscripts, etc., totaling over 80k) document elements. Each element's region includes recognition results (text annotations, LaTeX annotations for formulas, and both LaTeX and HTML annotations for tables). OmniDocBench also provides annotations for the reading order of document components. Additionally, it includes various attribute tags at the page and block levels, with annotations for 5 page attribute tags, 3 text attribute tags, and 6 table attribute tags.
  • High Annotation Quality: The data quality is high, achieved through manual screening, intelligent annotation, manual annotation, and comprehensive expert and large model quality checks.
  • Supporting Evaluation Code: It includes end-to-end and single-module evaluation code to ensure fairness and accuracy in assessments.

OmniDocBench is designed for Document Parsing, featuring rich annotations for evaluation across several dimensions:

  • End-to-end evaluation
  • Layout detection
  • Table recognition
  • Formula recognition
  • Text OCR

Currently supported metrics include:

  • Normalized Edit Distance
  • BLEU
  • METEOR
  • TEDS
  • COCODet (mAP, mAR, etc.)

Benchmark Introduction

This benchmark includes 981 PDF pages, covering 9 document types, 4 layout types, and 3 language types. OmniDocBench features rich annotations, containing 15 block-level annotations (text paragraphs, headings, tables, etc.) and 4 span-level annotations (text lines, inline formulas, subscripts, etc.). All text-related annotation boxes include text recognition annotations, formulas contain LaTeX annotations, and tables include both LaTeX and HTML annotations. OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute tags at the page and block levels, with annotations for 5 page attribute tags, 3 text attribute tags, and 6 table attribute tags.

Dataset Format

The dataset format is JSON, with the following structure and field explanations:

[{
    "layout_dets": [    // List of page elements
        {
            "category_type": "text_block",  // Category name
            "poly": [
                136.0, // Position information, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y)
                781.0,
                340.0,
                781.0,
                340.0,
                806.0,
                136.0,
                806.0
            ],
            "ignore": false,        // Whether to ignore during evaluation
            "order": 0,             // Reading order
            "anno_id": 0,           // Special annotation ID, unique for each layout box
            "text": "xxx",          // Optional field, Text OCR results are written here
            "latex": "$xxx$",       // Optional field, LaTeX for formulas and tables is written here
            "html": "xxx",          // Optional field, HTML for tables is written here
            "attribute" {"xxx": "xxx"},         // Classification attributes for layout, detailed below
            "line_with_spans:": [   // Span level annotation boxes
                {
                    "category_type": "text_span",
                    "poly": [...],
                    "ignore": false,
                    "text": "xxx",   
                    "latex": "$xxx$",
                 },
                 ...
            ],
            "merge_list": [    // Only present in annotation boxes with merge relationships, merge logic depends on whether single line break separated paragraphs exist, like list types
                {
                    "category_type": "text_block", 
                    "poly": [...],
                    ...   // Same fields as block level annotations
                    "line_with_spans": [...]
                    ...
                 },
                 ...
            ]
        ...
    ],
    "page_info": {         
        "page_no": 0,            // Page number
        "height": 1684,          // Page height
        "width": 1200,           // Page width
        "image_path": "xx/xx/",  // Annotated page filename
        "page_attribute": {"xxx": "xxx"}     // Page attribute labels
    },
    "extra": {
        "relation": [ // Related annotations
            {  
                "source_anno_id": 1,
                "target_anno_id": 2, 
                "relation": "parent_son"  // Relationship label between figure/table and their corresponding caption/footnote categories
            },
            {  
                "source_anno_id": 5,
                "target_anno_id": 6,
                "relation_type": "truncated"  // Paragraph truncation relationship label due to layout reasons, will be concatenated and evaluated as one paragraph during evaluation
            },
        ]
    }
},
...
]
Evaluation Categories

Evaluation categories include:

# Block level annotation boxes
'title'               # Title
'text_block'          # Paragraph level plain text
'figure',             # Figure type
'figure_caption',     # Figure description/title
'figure_footnote',    # Figure notes
'table',              # Table body
'table_caption',      # Table description/title
'table_footnote',     # Table notes
'equation_isolated',  # Display formula
'equation_caption',   # Formula number
'header'              # Header
'footer'              # Footer
'page_number'         # Page number
'page_footnote'       # Page notes
'abandon',            # Other discarded content (e.g. irrelevant information in middle of page)
'code_txt',           # Code block
'code_txt_caption',   # Code block description
'reference',          # References

# Span level annotation boxes
'text_span'           # Span level plain text
'equation_ignore',    # Formula to be ignored
'equation_inline',    # Inline formula
'footnote_mark',      # Document superscripts/subscripts
Attribute Labels

Page classification attributes include:

'data_source': #PDF type classification
    academic_literature  # Academic literature
    PPT2PDF # PPT to PDF
    book # Black and white books and textbooks
    colorful_textbook # Colorful textbooks with images
    exam_paper # Exam papers
    note # Handwritten notes
    magazine # Magazines
    research_report # Research reports and financial reports
    newspaper # Newspapers

'language': #Language type
    en # English
    simplified_chinese # Simplified Chinese
    en_ch_mixed # English-Chinese mixed

'layout': #Page layout type
    single_column # Single column
    double_column # Double column
    three_column # Three column
    1andmore_column # One mixed with multiple columns, common in literature
    other_layout # Other layouts

'watermark': # Whether contains watermark
    true  
    false

'fuzzy_scan': # Whether blurry scanned
    true  
    false

'colorful_backgroud': # Whether contains colorful background, content to be recognized has more than two background colors
    true  
    false

Block level attribute - Table related attributes:

'table_layout': # Table orientation
    vertical # Vertical table
    horizontal # Horizontal table

'with_span': # Merged cells
    False
    True

'line': # Table borders
    full_line # Full borders
    less_line # Partial borders
    fewer_line # Three-line borders
    wireless_line # No borders

'language': # Table language
    table_en # English table
    table_simplified_chinese # Simplified Chinese table
    table_en_ch_mixed # English-Chinese mixed table

'include_equation': # Whether table contains formulas
    False
    True

'include_backgroud': # Whether table contains background color
    False
    True

'table_vertical' # Whether table is rotated 90 or 270 degrees
    False
    True

Block level attribute - Text paragraph related attributes:

'text_language': # Text language
    text_en  # English
    text_simplified_chinese # Simplified Chinese
    text_en_ch_mixed  # English-Chinese mixed

'text_background':  # Text background color
    white # Default value, white background
    single_colored # Single background color other than white
    multi_colored  # Multiple background colors

'text_rotate': # Text rotation classification within paragraphs
    normal # Default value, horizontal text, no rotation
    rotate90  # Rotation angle, 90 degrees clockwise
    rotate180 # 180 degrees clockwise
    rotate270 # 270 degrees clockwise
    horizontal # Text is normal but layout is vertical

Block level attribute - Formula related attributes:

'formula_type': # Formula type
    print  # Print
    handwriting # Handwriting

Evaluation

OmniDocBench has developed an evaluation methodology based on document component segmentation and matching. It provides corresponding metric calculations for four major modules: text, tables, formulas, and reading order. In addition to overall accuracy results, the evaluation also provides fine-grained evaluation results by page and attributes, precisely identifying pain points in model document parsing.

Environment Setup and Running

To set up the environment, simply run the following commands in the project directory:

conda create -n omnidocbench python=3.8
conda activate omnidocbench
pip install -r requirements.txt

All evaluation inputs are configured through config files. We provide templates for each task under the configs directory, and we will explain the contents of the config files in detail in the following sections.

After configuring the config file, simply pass it as a parameter and run the following code to perform the evaluation:

python pdf_validation.py --config <config_path>

End-to-End Evaluation

End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction.

Comprehensive evaluation of document parsing algorithms on OmniDocBench: performance metrics for text, formula, table, and reading order extraction, with overall scores derived from ground truth comparisons.
Method Type Methods TextEdit↓ FormulaEdit↓ FormulaCDM↑ TableTEDS↑ TableEdit↓ Read OrderEdit↓ OverallEdit↓
EN ZH EN ZH EN ZH EN ZH EN ZH EN ZH EN ZH
Pipeline Tools MinerU-0.9.3 0.058 0.211 0.278 0.577 66.9 49.5 79.4 62.7 0.305 0.461 0.079 0.288 0.180 0.384
Marker-0.2.17 0.141 0.303 0.667 0.868 18.4 12.7 54.0 45.8 0.718 0.763 0.138 0.306 0.416 0.560
Mathpix 0.101 0.358 0.306 0.454 71.4 72.7 77.9 68.2 0.322 0.416 0.105 0.275 0.209 0.376
Expert VLMs GOT-OCR 0.187 0.315 0.360 0.528 81.8 51.4 53.5 48.0 0.521 0.594 0.141 0.28 0.302 0.429
Nougat 0.365 0.998 0.488 0.941 17.4 16.9 40.3 0.0 0.622 1.000 0.382 0.954 0.464 0.973
General VLMs GPT4o 0.144 0.409 0.425 0.606 76.4 48.2 72.75 63.7 0.363 0.474 0.128 0.251 0.265 0.435
Qwen2-VL-72B 0.252 0.251 0.468 0.572 54.9 60.9 59.9 66.8 0.591 0.587 0.255 0.223 0.392 0.408
InternVL2-Llama3-76B 0.353 0.290 0.543 0.701 69.8 49.6 63.8 61.1 0.616 0.638 0.317 0.228 0.457 0.464

More detailed attribute-level evaluation results are shown in the paper.

End-to-End Evaluation Method - end2end

End-to-end evaluation consists of two approaches:

  • end2end: This method uses OmniDocBench's JSON files as Ground Truth. For config file reference, see: end2end
  • md2md: This method uses OmniDocBench's markdown format as Ground Truth. Details will be discussed in the next section markdown-to-markdown evaluation.

We recommend using the end2end evaluation approach since it preserves the category and attribute information of samples, enabling special category ignore operations and attribute-level result output.

The end2end evaluation can assess four dimensions. We provide an example of end2end evaluation results in result, including:

  • Text paragraphs
  • Display formulas
  • Tables
  • Reading order
Field explanations for end2end.yaml

The configuration of end2end.yaml is as follows:

end2end_eval:          # Specify task name, common for end-to-end evaluation
  metrics:             # Configure metrics to use
    text_block:        # Configuration for text paragraphs
      metric:
        - Edit_dist    # Normalized Edit Distance
        - BLEU         
        - METEOR
    display_formula:   # Configuration for display formulas
      metric:
        - Edit_dist
        - CDM          # Only supports exporting format required for CDM evaluation, stored in results
    table:             # Configuration for tables
      metric:
        - TEDS
        - Edit_dist
    reading_order:     # Configuration for reading order
      metric:
        - Edit_dist
  dataset:                                       # Dataset configuration
    dataset_name: end2end_dataset                # Dataset name, no need to modify
    ground_truth:
      data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json  # Path to OmniDocBench
    prediction:
      data_path: ./demo_data/end2end            # Folder path for model's PDF page parsing markdown results
    match_method: quick_match                    # Matching method, options: no_split/no_split/quick_match
    filter:                                      # Page-level filtering
      language: english                          # Page attributes and corresponding tags to evaluate

The data_path under prediction is the folder path containing the model's PDF page parsing results. The folder contains markdown files for each page, with filenames matching the image names but replacing the .jpg extension with .md.

In addition to the supported metrics, the system also supports exporting formats required for CDM evaluation. Simply configure the CDM field in the metrics section to format the output for CDM input and store it in result.

For end-to-end evaluation, the config allows selecting different matching methods. There are three matching approaches:

  • no_split: Does not split or match text blocks, but rather combines them into a single markdown for calculation. This method will not output attribute-level results or reading order results.
  • simple_match: Performs only paragraph segmentation using double line breaks, then directly matches one-to-one with GT without any truncation or merging.
  • quick_match: Builds on paragraph segmentation by adding truncation and merging operations to reduce the impact of paragraph segmentation differences on final results, using Adjacency Search Match for truncation and merging.

We recommend using quick_match for better matching results. However, if the model's paragraph segmentation is accurate, simple_match can be used for faster evaluation. The matching method is configured through the match_method field under dataset in the config.

The filter field allows filtering the dataset. For example, setting filter to language: english under dataset will evaluate only pages in English. See the Dataset Introduction section for more page attributes. Comment out the filter fields to evaluate the full dataset.

End-to-end Evaluation Method - md2md

The markdown-to-markdown evaluation uses the model's markdown output of the entire PDF page parsing as the Prediction, and OmniDocBench's markdown format as the Ground Truth. Please refer to the config file: md2md. We recommend using the end2end approach from the previous section to evaluate with OmniDocBench, as it preserves rich attribute annotations and ignore logic. However, we still provide the md2md evaluation method to align with existing evaluation approaches.

The md2md evaluation can assess four dimensions:

  • Text paragraphs
  • Display formulas
  • Tables
  • Reading order
Field explanations for md2md.yaml

The configuration of md2md.yaml is as follows:

end2end_eval:          # Specify task name, common for end-to-end evaluation
  metrics:             # Configure metrics to use
    text_block:        # Configuration for text paragraphs
      metric:
        - Edit_dist    # Normalized Edit Distance
        - BLEU         
        - METEOR
    display_formula:   # Configuration for display formulas
      metric:
        - Edit_dist
        - CDM          # Only supports exporting format required for CDM evaluation, stored in results
    table:             # Configuration for tables
      metric:
        - TEDS
        - Edit_dist
    reading_order:     # Configuration for reading order
      metric:
        - Edit_dist
  dataset:                                               # Dataset configuration
    dataset_name: md2md_dataset                          # Dataset name, no need to modify
    ground_truth:                                        # Configuration for ground truth dataset
      data_path: ./demo_data/omnidocbench_demo/mds       # Path to OmniDocBench markdown folder
      page_info: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json          # Path to OmniDocBench JSON file, mainly used to get page-level attributes
    prediction:                                          # Configuration for model predictions
      data_path: ./demo_data/end2end                     # Folder path for model's PDF page parsing markdown results
    match_method: quick_match                            # Matching method, options: no_split/no_split/quick_match
    filter:                                              # Page-level filtering
      language: english                                  # Page attributes and corresponding tags to evaluate

The data_path under prediction is the folder path for the model's PDF page parsing results, which contains markdown files corresponding to each page. The filenames match the image names, with only the .jpg extension replaced with .md.

The data_path under ground_truth is the path to OmniDocBench's markdown folder, with filenames corresponding one-to-one with the model's PDF page parsing markdown results. The page_info path under ground_truth is the path to OmniDocBench's JSON file, mainly used to obtain page-level attributes. If page-level attribute evaluation results are not needed, this field can be commented out. However, without configuring the page_info field under ground_truth, the filter related functionality cannot be used.

For explanations of other fields in the config, please refer to the End-to-end Evaluation - end2end section.

Formula Recognition Evaluation

OmniDocBench contains bounding box information for formulas on each PDF page along with corresponding formula recognition annotations, making it suitable as a benchmark for formula recognition evaluation. Formulas include display formulas (equation_isolated) and inline formulas (equation_inline). Currently, this repo provides examples for evaluating display formulas.

Models CDM ExpRate@CDM BLEU Norm Edit
GOT-OCR 74.1 28.0 55.07 0.290
Mathpix 86.6 2.8 66.56 0.322
Pix2Tex 73.9 39.5 46.00 0.337
UniMERNet-B 85.0 60.2 60.84 0.238
GPT4o 86.8 65.5 45.17 0.282
InternVL2-Llama3-76B 67.4 54.5 47.63 0.308
Qwen2-VL-72B 83.8 55.4 53.71 0.285

Component-level formula recognition evaluation on OmniDocBench formula subset.

Formula recognition evaluation can be configured according to formula_recognition.

Field explanations for formula_recognition.yaml

The configuration of formula_recognition.yaml is as follows:

recogition_eval:      # Specify task name, common for all recognition-related tasks
  metrics:            # Configure metrics to use
    - Edit_dist       # Normalized Edit Distance
    - CDM             # Only supports exporting formats required for CDM evaluation, stored in results
  dataset:                                                                   # Dataset configuration
    dataset_name: omnidocbench_single_module_dataset                         # Dataset name, no need to modify if following specified input format
    ground_truth:                                                            # Ground truth dataset configuration  
      data_path: ./demo_data/recognition/OmniDocBench_demo_formula.json      # JSON file containing both ground truth and model prediction results
      data_key: latex                                                        # Field name storing Ground Truth, for OmniDocBench, formula recognition results are stored in latex field
      category_filter: ['equation_isolated']                                 # Categories used for evaluation, in formula recognition, the category_name is equation_isolated
    prediction:                                                              # Model prediction configuration
      data_key: pred                                                         # Field name storing model prediction results, this is user-defined
    category_type: formula                                                   # category_type is mainly used for selecting data preprocessing strategy, options: formula/text

For the metrics section, in addition to the supported metrics, it also supports exporting formats required for CDM evaluation. Simply configure the CDM field in metrics to organize the output into CDM input format, which will be stored in result.

For the dataset section, the data format in the ground_truth data_path remains consistent with OmniDocBench, with just a custom field added under the corresponding formula sample to store the model's prediction results. The field storing prediction information is specified through the data_key under the prediction field in dataset, such as pred. For more details about OmniDocBench's file structure, please refer to the "Dataset Introduction" section. The input format for model results can be found in OmniDocBench_demo_formula, which follows this format:

[{
    "layout_dets": [    // List of page elements
        {
            "category_type": "equation_isolated",  // OmniDocBench category name
            "poly": [    // OmniDocBench position info, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y)
                136.0, 
                781.0,
                340.0,
                781.0,
                340.0,
                806.0,
                136.0,
                806.0
            ],
            ...   // Other OmniDocBench fields
            "latex": "$xxx$",  // LaTeX formula will be written here
            "pred": "$xxx$",   // !! Model prediction result stored here, user-defined new field at same level as ground truth
            
        ...
    ],
    "page_info": {...},        // OmniDocBench page information
    "extra": {...}             // OmniDocBench annotation relationship information
},
...
]

Here is a model inference script for reference:

import os
import json
from PIL import Image

def poly2bbox(poly):
    L = poly[0]
    U = poly[1]
    R = poly[2]
    D = poly[5]
    L, R = min(L, R), max(L, R)
    U, D = min(U, D), max(U, D)
    bbox = [L, U, R, D]
    return bbox

question = "<image>\nPlease convert this cropped image directly into latex."

with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
    samples = json.load(f)
    
for sample in samples:
    img_name = os.path.basename(sample['page_info']['image_path'])
    img_path = os.path.join('./Docparse/images', img_name)
    img = Image.open(img_path)

    if not os.path.exists(img_path):
        print('No exist: ', img_name)
        continue

    for i, anno in enumerate(sample['layout_dets']):
        if anno['category_type'] != 'equation_isolated':   # Filter out equation_isolated category for evaluation
            continue

        bbox = poly2bbox(anno['poly'])
        im = img.crop(bbox).convert('RGB')
        response = model.chat(im, question)  # Modify the way the image is passed in according to the model
        anno['pred'] = response              # Directly add a new field to store the model's inference results under the corresponding annotation

with open('./demo_data/recognition/OmniDocBench_demo_formula.json', 'w', encoding='utf-8') as f:
    json.dump(samples, f, ensure_ascii=False)

OCR Text Recognition Evaluation

OmniDocBench contains bounding box information and corresponding text recognition annotations for all text in each PDF page, making it suitable as a benchmark for OCR evaluation. The text annotations include both block-level and span-level annotations, both of which can be used for evaluation. This repo currently provides an example of block-level evaluation, which evaluates OCR at the text paragraph level.

Component-level evaluation on OmniDocBench OCR subset: results grouped by text attributes using the edit distance metric.
Model Type Model Language Text background Text Rotate
EN ZH Mixed White Single Multi Normal Rotate90 Rotate270 Horizontal
Expert Vision Models PaddleOCR 0.071 0.055 0.118 0.060 0.038 0.0848 0.060 0.015 0.285 0.021
Tesseract OCR 0.179 0.553 0.553 0.453 0.463 0.394 0.448 0.369 0.979 0.982
Surya 0.057 0.123 0.164 0.093 0.186 0.235 0.104 0.634 0.767 0.255
GOT-OCR 0.041 0.112 0.135 0.092 0.052 0.155 0.091 0.562 0.966 0.097
Mathpix 0.033 0.240 0.261 0.185 0.121 0.166 0.180 0.038 0.185 0.638
Vision Language Models Qwen2-VL-72B 0.072 0.274 0.286 0.234 0.155 0.148 0.223 0.273 0.721 0.067
InternVL2-Llama3-76B 0.074 0.155 0.242 0.113 0.352 0.269 0.132 0.610 0.907 0.595
GPT4o 0.020 0.224 0.125 0.167 0.140 0.220 0.168 0.115 0.718 0.132

OCR text recognition evaluation can be configured according to ocr.

The field explanation of ocr.yaml

The configuration file for ocr.yaml is as follows:

recogition_eval:      # Specify task name, common for all recognition-related tasks
  metrics:            # Configure metrics to use
    - Edit_dist       # Normalized Edit Distance
    - BLEU
    - METEOR
  dataset:                                                                   # Dataset configuration
    dataset_name: omnidocbench_single_module_dataset                         # Dataset name, no need to modify if following the specified input format
    ground_truth:                                                            # Ground truth dataset configuration
      data_path: ./demo_data/recognition/OmniDocBench_demo_text_ocr.json     # JSON file containing both ground truth and model prediction results
      data_key: text                                                         # Field name storing Ground Truth, for OmniDocBench, text recognition results are stored in the text field, all block level annotations containing text field will participate in evaluation
    prediction:                                                              # Model prediction configuration
      data_key: pred                                                         # Field name storing model prediction results, this is user-defined
    category_type: text                                                      # category_type is mainly used for selecting data preprocessing strategy, options: formula/text

For the dataset section, the input ground_truth data_path follows the same data format as OmniDocBench, with just a new custom field added under samples containing the text field to store the model's prediction results. The field storing prediction information is specified through the data_key under the prediction field in dataset, for example pred. The input format of the dataset can be referenced in OmniDocBench_demo_text_ocr, and the meanings of various fields can be found in the examples provided in the Formula Recognition Evaluation section.

Here is a reference model inference script for your consideration:

import os
import json
from PIL import Image

def poly2bbox(poly):
    L = poly[0]
    U = poly[1]
    R = poly[2]
    D = poly[5]
    L, R = min(L, R), max(L, R)
    U, D = min(U, D), max(U, D)
    bbox = [L, U, R, D]
    return bbox

question = "<image>\nPlease convert this cropped image directly into latex."

with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
    samples = json.load(f)
    
for sample in samples:
    img_name = os.path.basename(sample['page_info']['image_path'])
    img_path = os.path.join('./Docparse/images', img_name)
    img = Image.open(img_path)

    if not os.path.exists(img_path):
        print('No exist: ', img_name)
        continue

    for i, anno in enumerate(sample['layout_dets']):
        if not anno.get('text'):             # Filter out annotations containing the text field from OmniDocBench for model inference
            continue

        bbox = poly2bbox(anno['poly'])
        im = img.crop(bbox).convert('RGB')
        response = model.chat(im, question)  # Modify the way the image is passed in according to the model
        anno['pred'] = response              # Directly add a new field to store the model's inference results under the corresponding annotation

with open('./demo_data/recognition/OmniDocBench_demo_text_ocr.json', 'w', encoding='utf-8') as f:
    json.dump(samples, f, ensure_ascii=False)

Table Recognition Evaluation

OmniDocBench contains bounding box information for tables on each PDF page along with corresponding table recognition annotations, making it suitable as a benchmark for table recognition evaluation. The table annotations are available in both HTML and LaTeX formats, with this repository currently providing examples for HTML format evaluation.

Model Type Model Language Table Frame Type Special Situation Overall
EN ZH Mixed Full Omission Three Zero Merge Cell(+/-) Formula(+/-) Colorful(+/-) Rotate(+/-)
OCR-based Models PaddleOCR 76.8 71.8 80.1 67.9 74.3 81.1 74.5 70.6/75.2 71.3/74.1 72.7/74.0 23.3/74.6 73.6
RapidTable 80.0 83.2 91.2 83.0 79.7 83.4 78.4 77.1/85.4 76.7/83.9 77.6/84.9 25.2/83.7 82.5
Expert VLMs StructEqTable 72.0 72.6 81.7 68.8 64.3 80.7 85.0 65.1/76.8 69.4/73.5 66.8/75.7 44.1/73.3 72.7
GOT-OCR 72.2 75.5 85.4 73.1 72.7 78.2 75.7 65.0/80.2 64.3/77.3 70.8/76.9 8.5/76.3 74.9
General VLMs Qwen2-VL-7B 70.2 70.7 82.4 70.2 62.8 74.5 80.3 60.8/76.5 63.8/72.6 71.4/70.8 20.0/72.1 71.0
InternVL2-8B 70.9 71.5 77.4 69.5 69.2 74.8 75.8 58.7/78.4 62.4/73.6 68.2/73.1 20.4/72.6 71.5

Component-level Table Recognition evaluation on OmniDocBench table subset. (+/-) means with/without special situation.

Table recognition evaluation can be configured according to table_recognition.

For tables predicted to be in LaTeX format, the latexml tool will be used to convert LaTeX to HTML before evaluation. The evaluation code will automatically perform format conversion, and users need to preinstall latexml

The field explanation of table_recognition.yaml

The configuration file for table_recognition.yaml is as follows:

recogition_eval:      # Specify task name, common for all recognition-related tasks
  metrics:            # Configure metrics to use
    - TEDS            # Tree Edit Distance based Similarity
    - Edit_dist       # Normalized Edit Distance
  dataset:                                                                   # Dataset configuration
    dataset_name: omnidocbench_single_module_dataset                         # Dataset name, no need to modify if following specified input format
    ground_truth:                                                            # Configuration for ground truth dataset
      data_path: ./demo_data/recognition/OmniDocBench_demo_table.json        # JSON file containing both ground truth and model prediction results
      data_key: html                                                         # Field name storing Ground Truth, for OmniDocBench, table recognition results are stored in html and latex fields, change to latex when evaluating latex format tables
      category_filter: table                                                 # Category for evaluation, in table recognition, the category_name is table
    prediction:                                                              # Configuration for model prediction results
      data_key: pred                                                         # Field name storing model prediction results, this is user-defined
    category_type: table                                                     # category_type is mainly used for data preprocessing strategy selection

For the dataset section, the data format in the ground_truth's data_path remains consistent with OmniDocBench, with only a custom field added under the corresponding table sample to store the model's prediction result. The field storing prediction information is specified through data_key under the prediction field in dataset, such as pred. For more details about OmniDocBench's file structure, please refer to the "Dataset Introduction" section. The input format for model results can be found in OmniDocBench_demo_table, which follows this format:

[{
    "layout_dets": [    // List of page elements
        {
            "category_type": "table",  // OmniDocBench category name
            "poly": [    // OmniDocBench position info: x,y coordinates for top-left, top-right, bottom-right, bottom-left corners
                136.0, 
                781.0,
                340.0,
                781.0,
                340.0,
                806.0,
                136.0,
                806.0
            ],
            ...   // Other OmniDocBench fields
            "latex": "$xxx$",  // Table LaTeX annotation goes here
            "html": "$xxx$",  // Table HTML annotation goes here
            "pred": "$xxx$",   // !! Model prediction result stored here, user-defined new field at same level as ground truth
            
        ...
    ],
    "page_info": {...},        // OmniDocBench page information
    "extra": {...}             // OmniDocBench annotation relationship information
},
...
]

Here is a model inference script for reference:

import os
import json
from PIL import Image

def poly2bbox(poly):
    L = poly[0]
    U = poly[1]
    R = poly[2]
    D = poly[5]
    L, R = min(L, R), max(L, R)
    U, D = min(U, D), max(U, D)
    bbox = [L, U, R, D]
    return bbox

question = "<image>\nPlease convert this cropped image directly into html format of table."

with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
    samples = json.load(f)
    
for sample in samples:
    img_name = os.path.basename(sample['page_info']['image_path'])
    img_path = os.path.join('./demo_data/omnidocbench_demo/images', img_name)
    img = Image.open(img_path)

    if not os.path.exists(img_path):
        print('No exist: ', img_name)
        continue

    for i, anno in enumerate(sample['layout_dets']):
        if anno['category_type'] != 'table':   # Filter out the table category for evaluation
            continue

        bbox = poly2bbox(anno['poly'])
        im = img.crop(bbox).convert('RGB')
        response = model.chat(im, question)  # Need to modify the way the image is passed in depending on the model
        anno['pred'] = response              # Directly add a new field to store the model's inference result at the same level as the ground truth

with open('./demo_data/recognition/OmniDocBench_demo_table.json', 'w', encoding='utf-8') as f:
    json.dump(samples, f, ensure_ascii=False)

Layout Detection

OmniDocBench contains bounding box information for all document components on each PDF page, making it suitable as a benchmark for layout detection task evaluation.

Component-level layout detection evaluation on OmniDocBench layout subset: mAP results by PDF page type.
Model Book Slides Research Report Textbook Exam Paper Magazine Academic Literature Notes Newspaper Average mAP
DiT-L 43.44 13.72 45.85 15.45 3.40 29.23 66.13 0.21 23.65 26.90
LayoutLMv3 42.12 13.63 43.22 21.00 5.48 31.81 64.66 0.80 30.84 28.84
DOCX-Chain 30.86 11.71 39.62 19.23 10.67 23.00 41.60 1.80 16.96 21.27
DocLayout-YOLO 43.71 48.71 72.83 42.67 35.40 51.44 66.84 9.54 57.54 48.71

Layout detection config file reference layout_detection, data format reference detection_prediction.

The field explanation of layout_detection.yaml

Here is the configuration file for layout_detection.yaml:

detection_eval:   # Specify task name, common for all detection-related tasks
  metrics:
    - COCODet     # Detection task related metrics, mainly mAP, mAR etc.
  dataset: 
    dataset_name: detection_dataset_simple_format       # Dataset name, no need to modify if following specified input format
    ground_truth:
      data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json               # Path to OmniDocBench JSON file
    prediction:
      data_path: ./demo_data/detection/detection_prediction.json                    # Path to model prediction result JSON file
    filter:                                             # Page level filtering
      data_source: exam_paper                           # Page attributes and corresponding tags to be evaluated
  categories:
    eval_cat:                # Categories participating in final evaluation
      block_level:           # Block level categories, see OmniDocBench evaluation set introduction for details
        - title              # Title
        - text               # Text  
        - abandon            # Includes headers, footers, page numbers, and page annotations
        - figure             # Image
        - figure_caption     # Image caption
        - table              # Table
        - table_caption      # Table caption
        - table_footnote     # Table footnote
        - isolate_formula    # Display formula (this is a layout display formula, lower priority than 14)
        - formula_caption    # Display formula label
    gt_cat_mapping:          # Mapping table from ground truth to final evaluation categories, key is ground truth category, value is final evaluation category name
      figure_footnote: figure_footnote
      figure_caption: figure_caption 
      page_number: abandon 
      header: abandon 
      page_footnote: abandon
      table_footnote: table_footnote 
      code_txt: figure 
      equation_caption: formula_caption 
      equation_isolated: isolate_formula
      table: table 
      refernece: text 
      table_caption: table_caption 
      figure: figure 
      title: title 
      text_block: text 
      footer: abandon
    pred_cat_mapping:       # Mapping table from prediction to final evaluation categories, key is prediction category, value is final evaluation category name
      title : title
      plain text: text
      abandon: abandon
      figure: figure
      figure_caption: figure_caption
      table: table
      table_caption: table_caption
      table_footnote: table_footnote
      isolate_formula: isolate_formula
      formula_caption: formula_caption

The filter field can be used to filter the dataset. For example, setting the filter field under dataset to data_source: exam_paper will filter for pages with data type exam_paper. For more page attributes, please refer to the "Evaluation Set Introduction" section. If you want to evaluate the full dataset, comment out the filter related fields.

The data_path under the prediction section in dataset takes the model's prediction as input, with the following data format:

{
    "results": [
        {
            "image_name": "docstructbench_llm-raw-scihub-o.O-adsc.201190003.pdf_6",                     // image name
            "bbox": [53.892921447753906, 909.8675537109375, 808.5555419921875, 1006.2714233398438],     // bounding box coordinates, representing x,y coordinates of top-left and bottom-right corners
            "category_id": 1,                                                                           // category ID number
            "score": 0.9446213841438293                                                                 // confidence score
        }, 
        ...                                                                                             // all bounding boxes are flattened in a single list
    ],
    "categories": {"0": "title", "1": "plain text", "2": "abandon", ...}                                // mapping between category IDs and category names

Formula Detection

OmniDocBench contains bounding box information for each formula on each PDF page, making it suitable as a benchmark for formula detection task evaluation.

The format for formula detection is essentially the same as layout detection. Formulas include both inline and display formulas. In this section, we provide a config example that can evaluate detection results for both display formulas and inline formulas simultaneously. Formula detection can be configured according to formula_detection.

The field explanation of formula_detection.yaml

Here is the configuration file for formula_detection.yaml:

detection_eval:   # Specify task name, common for all detection-related tasks
  metrics:
    - COCODet     # Detection task related metrics, mainly mAP, mAR etc.
  dataset: 
    dataset_name: detection_dataset_simple_format       # Dataset name, no need to modify if following specified input format
    ground_truth:
      data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json               # Path to OmniDocBench JSON file
    prediction:
      data_path: ./demo_data/detection/detection_prediction.json                     # Path to model prediction JSON file
    filter:                                             # Page-level filtering
      data_source: exam_paper                           # Page attributes and corresponding tags to evaluate
  categories:
    eval_cat:                                  # Categories participating in final evaluation
      block_level:                             # Block level categories, see OmniDocBench dataset intro for details
        - isolate_formula                      # Display formula
      span_level:                              # Span level categories, see OmniDocBench dataset intro for details
        - inline_formula                       # Inline formula
    gt_cat_mapping:                            # Mapping table from ground truth to final evaluation categories, key is ground truth category, value is final evaluation category name
      equation_isolated: isolate_formula
      equation_inline: inline_formula
    pred_cat_mapping:                          # Mapping table from prediction to final evaluation categories, key is prediction category, value is final evaluation category name
      interline_formula: isolate_formula
      inline_formula: inline_formula

Please refer to the Layout Detection section for parameter explanations and dataset format. The main difference between formula detection and layout detection is that under the eval_cat category that participates in the final evaluation, a span_level category inline_formula has been added. Both span_level and block_level categories will participate together in the evaluation.

Tools

We provide several tools in the tools directory:

  • json2md for converting OmniDocBench from JSON format to Markdown format;
  • visualization for visualizing OmniDocBench JSON files;
  • The model_infer folder provides some model inference scripts for reference, including:
    • mathpix_img2md.py for calling mathpix API to convert images to Markdown format;
    • internvl2_test_img2md.py for using InternVL2 model to convert images to Markdown format, please use after configuring the InternVL2 model environment;
    • GOT_img2md.py for using GOT-OCR model to convert images to Markdown format, please use after configuring the GOT-OCR model environment;
    • Qwen2VL_img2md.py for using QwenVL model to convert images to Markdown format, please use after configuring the QwenVL model environment;

TODO

  • Integration of match_full algorithm
  • Optimization of matching post-processing for model-specific output formats
  • Addition of Unicode mapping table for special characters

Known Issues

  • Some models occasionally produce non-standard output formats (e.g., recognizing multi-column text as tables, or formulas as Unicode text), leading to matching failures. This can be optimized through post-processing of model output formats
  • Due to varying symbol recognition capabilities across different models, some symbols are recognized inconsistently (e.g., list identifiers). Currently, only Chinese and English text are included in text evaluation. A Unicode mapping table will be added later for optimization

We welcome everyone to use the OmniDocBench dataset and provide valuable feedback and suggestions to help us continuously improve the dataset quality and evaluation tools. For any comments or suggestions, please feel free to open an issue and we will respond promptly. If you have evaluation scheme optimizations, you can submit a PR and we will review and update in a timely manner.

Acknowledgement

  • Thank Abaka AI for supporting the dataset annotation.
  • PubTabNet for TEDS metric calculation
  • latexml LaTeX to HTML conversion tool
  • Tester Markdown table to HTML conversion tool

Copyright Statement

The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact [email protected].

Citation

@misc{ouyang2024omnidocbenchbenchmarkingdiversepdf,
      title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations}, 
      author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He},
      year={2024},
      eprint={2412.07626},
      archivePrefix={arXiv},
      primaryClass={cs.CV},
      url={https://arxiv.org/abs/2412.07626}, 
}

About

A Comprehensive Benchmark for Document Parsing and Evaluation

Resources

License

Stars

Watchers

Forks

Releases

No releases published

Packages

No packages published

Languages

  • Python 96.1%
  • Jupyter Notebook 3.9%