[📜 arXiv] | [Dataset (🤗Hugging Face)] | [Dataset (OpenDataLab)]
OmniDocBench is a benchmark for evaluating diverse document parsing in real-world scenarios, featuring the following characteristics:
- Diverse Document Types: This benchmark includes 981 PDF pages, covering 9 document types, 4 layout types, and 3 language types. It encompasses a wide range of content, including academic papers, financial reports, newspapers, textbooks, and handwritten notes.
- Rich Annotation Information: It contains localization information for 15 block-level (such as text paragraphs, headings, tables, etc., totaling over 20k) and 4 span-level (such as text lines, inline formulas, subscripts, etc., totaling over 80k) document elements. Each element's region includes recognition results (text annotations, LaTeX annotations for formulas, and both LaTeX and HTML annotations for tables). OmniDocBench also provides annotations for the reading order of document components. Additionally, it includes various attribute tags at the page and block levels, with annotations for 5 page attribute tags, 3 text attribute tags, and 6 table attribute tags.
- High Annotation Quality: The data quality is high, achieved through manual screening, intelligent annotation, manual annotation, and comprehensive expert and large model quality checks.
- Supporting Evaluation Code: It includes end-to-end and single-module evaluation code to ensure fairness and accuracy in assessments.
OmniDocBench is designed for Document Parsing, featuring rich annotations for evaluation across several dimensions:
- End-to-end evaluation
- Layout detection
- Table recognition
- Formula recognition
- Text OCR
Currently supported metrics include:
- Normalized Edit Distance
- BLEU
- METEOR
- TEDS
- COCODet (mAP, mAR, etc.)
This benchmark includes 981 PDF pages, covering 9 document types, 4 layout types, and 3 language types. OmniDocBench features rich annotations, containing 15 block-level annotations (text paragraphs, headings, tables, etc.) and 4 span-level annotations (text lines, inline formulas, subscripts, etc.). All text-related annotation boxes include text recognition annotations, formulas contain LaTeX annotations, and tables include both LaTeX and HTML annotations. OmniDocBench also provides reading order annotations for document components. Additionally, it includes various attribute tags at the page and block levels, with annotations for 5 page attribute tags, 3 text attribute tags, and 6 table attribute tags.
Dataset Format
The dataset format is JSON, with the following structure and field explanations:
[{
"layout_dets": [ // List of page elements
{
"category_type": "text_block", // Category name
"poly": [
136.0, // Position information, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y)
781.0,
340.0,
781.0,
340.0,
806.0,
136.0,
806.0
],
"ignore": false, // Whether to ignore during evaluation
"order": 0, // Reading order
"anno_id": 0, // Special annotation ID, unique for each layout box
"text": "xxx", // Optional field, Text OCR results are written here
"latex": "$xxx$", // Optional field, LaTeX for formulas and tables is written here
"html": "xxx", // Optional field, HTML for tables is written here
"attribute" {"xxx": "xxx"}, // Classification attributes for layout, detailed below
"line_with_spans:": [ // Span level annotation boxes
{
"category_type": "text_span",
"poly": [...],
"ignore": false,
"text": "xxx",
"latex": "$xxx$",
},
...
],
"merge_list": [ // Only present in annotation boxes with merge relationships, merge logic depends on whether single line break separated paragraphs exist, like list types
{
"category_type": "text_block",
"poly": [...],
... // Same fields as block level annotations
"line_with_spans": [...]
...
},
...
]
...
],
"page_info": {
"page_no": 0, // Page number
"height": 1684, // Page height
"width": 1200, // Page width
"image_path": "xx/xx/", // Annotated page filename
"page_attribute": {"xxx": "xxx"} // Page attribute labels
},
"extra": {
"relation": [ // Related annotations
{
"source_anno_id": 1,
"target_anno_id": 2,
"relation": "parent_son" // Relationship label between figure/table and their corresponding caption/footnote categories
},
{
"source_anno_id": 5,
"target_anno_id": 6,
"relation_type": "truncated" // Paragraph truncation relationship label due to layout reasons, will be concatenated and evaluated as one paragraph during evaluation
},
]
}
},
...
]
Evaluation Categories
Evaluation categories include:
# Block level annotation boxes
'title' # Title
'text_block' # Paragraph level plain text
'figure', # Figure type
'figure_caption', # Figure description/title
'figure_footnote', # Figure notes
'table', # Table body
'table_caption', # Table description/title
'table_footnote', # Table notes
'equation_isolated', # Display formula
'equation_caption', # Formula number
'header' # Header
'footer' # Footer
'page_number' # Page number
'page_footnote' # Page notes
'abandon', # Other discarded content (e.g. irrelevant information in middle of page)
'code_txt', # Code block
'code_txt_caption', # Code block description
'reference', # References
# Span level annotation boxes
'text_span' # Span level plain text
'equation_ignore', # Formula to be ignored
'equation_inline', # Inline formula
'footnote_mark', # Document superscripts/subscripts
Attribute Labels
Page classification attributes include:
'data_source': #PDF type classification
academic_literature # Academic literature
PPT2PDF # PPT to PDF
book # Black and white books and textbooks
colorful_textbook # Colorful textbooks with images
exam_paper # Exam papers
note # Handwritten notes
magazine # Magazines
research_report # Research reports and financial reports
newspaper # Newspapers
'language': #Language type
en # English
simplified_chinese # Simplified Chinese
en_ch_mixed # English-Chinese mixed
'layout': #Page layout type
single_column # Single column
double_column # Double column
three_column # Three column
1andmore_column # One mixed with multiple columns, common in literature
other_layout # Other layouts
'watermark': # Whether contains watermark
true
false
'fuzzy_scan': # Whether blurry scanned
true
false
'colorful_backgroud': # Whether contains colorful background, content to be recognized has more than two background colors
true
false
Block level attribute - Table related attributes:
'table_layout': # Table orientation
vertical # Vertical table
horizontal # Horizontal table
'with_span': # Merged cells
False
True
'line': # Table borders
full_line # Full borders
less_line # Partial borders
fewer_line # Three-line borders
wireless_line # No borders
'language': # Table language
table_en # English table
table_simplified_chinese # Simplified Chinese table
table_en_ch_mixed # English-Chinese mixed table
'include_equation': # Whether table contains formulas
False
True
'include_backgroud': # Whether table contains background color
False
True
'table_vertical' # Whether table is rotated 90 or 270 degrees
False
True
Block level attribute - Text paragraph related attributes:
'text_language': # Text language
text_en # English
text_simplified_chinese # Simplified Chinese
text_en_ch_mixed # English-Chinese mixed
'text_background': # Text background color
white # Default value, white background
single_colored # Single background color other than white
multi_colored # Multiple background colors
'text_rotate': # Text rotation classification within paragraphs
normal # Default value, horizontal text, no rotation
rotate90 # Rotation angle, 90 degrees clockwise
rotate180 # 180 degrees clockwise
rotate270 # 270 degrees clockwise
horizontal # Text is normal but layout is vertical
Block level attribute - Formula related attributes:
'formula_type': # Formula type
print # Print
handwriting # Handwriting
OmniDocBench has developed an evaluation methodology based on document component segmentation and matching. It provides corresponding metric calculations for four major modules: text, tables, formulas, and reading order. In addition to overall accuracy results, the evaluation also provides fine-grained evaluation results by page and attributes, precisely identifying pain points in model document parsing.
To set up the environment, simply run the following commands in the project directory:
conda create -n omnidocbench python=3.8
conda activate omnidocbench
pip install -r requirements.txt
All evaluation inputs are configured through config files. We provide templates for each task under the configs directory, and we will explain the contents of the config files in detail in the following sections.
After configuring the config file, simply pass it as a parameter and run the following code to perform the evaluation:
python pdf_validation.py --config <config_path>
End-to-end evaluation assesses the model's accuracy in parsing PDF page content. The evaluation uses the model's Markdown output of the entire PDF page parsing results as the prediction.
Method Type | Methods | TextEdit↓ | FormulaEdit↓ | FormulaCDM↑ | TableTEDS↑ | TableEdit↓ | Read OrderEdit↓ | OverallEdit↓ | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | EN | ZH | ||
Pipeline Tools | MinerU-0.9.3 | 0.058 | 0.211 | 0.278 | 0.577 | 66.9 | 49.5 | 79.4 | 62.7 | 0.305 | 0.461 | 0.079 | 0.288 | 0.180 | 0.384 |
Marker-0.2.17 | 0.141 | 0.303 | 0.667 | 0.868 | 18.4 | 12.7 | 54.0 | 45.8 | 0.718 | 0.763 | 0.138 | 0.306 | 0.416 | 0.560 | |
Mathpix | 0.101 | 0.358 | 0.306 | 0.454 | 71.4 | 72.7 | 77.9 | 68.2 | 0.322 | 0.416 | 0.105 | 0.275 | 0.209 | 0.376 | |
Expert VLMs | GOT-OCR | 0.187 | 0.315 | 0.360 | 0.528 | 81.8 | 51.4 | 53.5 | 48.0 | 0.521 | 0.594 | 0.141 | 0.28 | 0.302 | 0.429 |
Nougat | 0.365 | 0.998 | 0.488 | 0.941 | 17.4 | 16.9 | 40.3 | 0.0 | 0.622 | 1.000 | 0.382 | 0.954 | 0.464 | 0.973 | |
General VLMs | GPT4o | 0.144 | 0.409 | 0.425 | 0.606 | 76.4 | 48.2 | 72.75 | 63.7 | 0.363 | 0.474 | 0.128 | 0.251 | 0.265 | 0.435 |
Qwen2-VL-72B | 0.252 | 0.251 | 0.468 | 0.572 | 54.9 | 60.9 | 59.9 | 66.8 | 0.591 | 0.587 | 0.255 | 0.223 | 0.392 | 0.408 | |
InternVL2-Llama3-76B | 0.353 | 0.290 | 0.543 | 0.701 | 69.8 | 49.6 | 63.8 | 61.1 | 0.616 | 0.638 | 0.317 | 0.228 | 0.457 | 0.464 |
More detailed attribute-level evaluation results are shown in the paper.
End-to-end evaluation consists of two approaches:
end2end
: This method uses OmniDocBench's JSON files as Ground Truth. For config file reference, see: end2endmd2md
: This method uses OmniDocBench's markdown format as Ground Truth. Details will be discussed in the next section markdown-to-markdown evaluation.
We recommend using the end2end
evaluation approach since it preserves the category and attribute information of samples, enabling special category ignore operations and attribute-level result output.
The end2end
evaluation can assess four dimensions. We provide an example of end2end evaluation results in result, including:
- Text paragraphs
- Display formulas
- Tables
- Reading order
Field explanations for end2end.yaml
The configuration of end2end.yaml
is as follows:
end2end_eval: # Specify task name, common for end-to-end evaluation
metrics: # Configure metrics to use
text_block: # Configuration for text paragraphs
metric:
- Edit_dist # Normalized Edit Distance
- BLEU
- METEOR
display_formula: # Configuration for display formulas
metric:
- Edit_dist
- CDM # Only supports exporting format required for CDM evaluation, stored in results
table: # Configuration for tables
metric:
- TEDS
- Edit_dist
reading_order: # Configuration for reading order
metric:
- Edit_dist
dataset: # Dataset configuration
dataset_name: end2end_dataset # Dataset name, no need to modify
ground_truth:
data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json # Path to OmniDocBench
prediction:
data_path: ./demo_data/end2end # Folder path for model's PDF page parsing markdown results
match_method: quick_match # Matching method, options: no_split/no_split/quick_match
filter: # Page-level filtering
language: english # Page attributes and corresponding tags to evaluate
The data_path
under prediction
is the folder path containing the model's PDF page parsing results. The folder contains markdown files for each page, with filenames matching the image names but replacing the .jpg
extension with .md
.
In addition to the supported metrics, the system also supports exporting formats required for CDM evaluation. Simply configure the CDM field in the metrics section to format the output for CDM input and store it in result.
For end-to-end evaluation, the config allows selecting different matching methods. There are three matching approaches:
no_split
: Does not split or match text blocks, but rather combines them into a single markdown for calculation. This method will not output attribute-level results or reading order results.simple_match
: Performs only paragraph segmentation using double line breaks, then directly matches one-to-one with GT without any truncation or merging.quick_match
: Builds on paragraph segmentation by adding truncation and merging operations to reduce the impact of paragraph segmentation differences on final results, using Adjacency Search Match for truncation and merging.
We recommend using quick_match
for better matching results. However, if the model's paragraph segmentation is accurate, simple_match
can be used for faster evaluation. The matching method is configured through the match_method
field under dataset
in the config.
The filter
field allows filtering the dataset. For example, setting filter
to language: english
under dataset
will evaluate only pages in English. See the Dataset Introduction section for more page attributes. Comment out the filter
fields to evaluate the full dataset.
The markdown-to-markdown evaluation uses the model's markdown output of the entire PDF page parsing as the Prediction, and OmniDocBench's markdown format as the Ground Truth. Please refer to the config file: md2md. We recommend using the end2end
approach from the previous section to evaluate with OmniDocBench, as it preserves rich attribute annotations and ignore logic. However, we still provide the md2md
evaluation method to align with existing evaluation approaches.
The md2md
evaluation can assess four dimensions:
- Text paragraphs
- Display formulas
- Tables
- Reading order
Field explanations for md2md.yaml
The configuration of md2md.yaml
is as follows:
end2end_eval: # Specify task name, common for end-to-end evaluation
metrics: # Configure metrics to use
text_block: # Configuration for text paragraphs
metric:
- Edit_dist # Normalized Edit Distance
- BLEU
- METEOR
display_formula: # Configuration for display formulas
metric:
- Edit_dist
- CDM # Only supports exporting format required for CDM evaluation, stored in results
table: # Configuration for tables
metric:
- TEDS
- Edit_dist
reading_order: # Configuration for reading order
metric:
- Edit_dist
dataset: # Dataset configuration
dataset_name: md2md_dataset # Dataset name, no need to modify
ground_truth: # Configuration for ground truth dataset
data_path: ./demo_data/omnidocbench_demo/mds # Path to OmniDocBench markdown folder
page_info: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json # Path to OmniDocBench JSON file, mainly used to get page-level attributes
prediction: # Configuration for model predictions
data_path: ./demo_data/end2end # Folder path for model's PDF page parsing markdown results
match_method: quick_match # Matching method, options: no_split/no_split/quick_match
filter: # Page-level filtering
language: english # Page attributes and corresponding tags to evaluate
The data_path
under prediction
is the folder path for the model's PDF page parsing results, which contains markdown files corresponding to each page. The filenames match the image names, with only the .jpg
extension replaced with .md
.
The data_path
under ground_truth
is the path to OmniDocBench's markdown folder, with filenames corresponding one-to-one with the model's PDF page parsing markdown results. The page_info
path under ground_truth
is the path to OmniDocBench's JSON file, mainly used to obtain page-level attributes. If page-level attribute evaluation results are not needed, this field can be commented out. However, without configuring the page_info
field under ground_truth
, the filter
related functionality cannot be used.
For explanations of other fields in the config, please refer to the End-to-end Evaluation - end2end section.
OmniDocBench contains bounding box information for formulas on each PDF page along with corresponding formula recognition annotations, making it suitable as a benchmark for formula recognition evaluation. Formulas include display formulas (equation_isolated
) and inline formulas (equation_inline
). Currently, this repo provides examples for evaluating display formulas.
Models | CDM | ExpRate@CDM | BLEU | Norm Edit |
---|---|---|---|---|
GOT-OCR | 74.1 | 28.0 | 55.07 | 0.290 |
Mathpix | 86.6 | 2.8 | 66.56 | 0.322 |
Pix2Tex | 73.9 | 39.5 | 46.00 | 0.337 |
UniMERNet-B | 85.0 | 60.2 | 60.84 | 0.238 |
GPT4o | 86.8 | 65.5 | 45.17 | 0.282 |
InternVL2-Llama3-76B | 67.4 | 54.5 | 47.63 | 0.308 |
Qwen2-VL-72B | 83.8 | 55.4 | 53.71 | 0.285 |
Component-level formula recognition evaluation on OmniDocBench formula subset.
Formula recognition evaluation can be configured according to formula_recognition.
Field explanations for formula_recognition.yaml
The configuration of formula_recognition.yaml
is as follows:
recogition_eval: # Specify task name, common for all recognition-related tasks
metrics: # Configure metrics to use
- Edit_dist # Normalized Edit Distance
- CDM # Only supports exporting formats required for CDM evaluation, stored in results
dataset: # Dataset configuration
dataset_name: omnidocbench_single_module_dataset # Dataset name, no need to modify if following specified input format
ground_truth: # Ground truth dataset configuration
data_path: ./demo_data/recognition/OmniDocBench_demo_formula.json # JSON file containing both ground truth and model prediction results
data_key: latex # Field name storing Ground Truth, for OmniDocBench, formula recognition results are stored in latex field
category_filter: ['equation_isolated'] # Categories used for evaluation, in formula recognition, the category_name is equation_isolated
prediction: # Model prediction configuration
data_key: pred # Field name storing model prediction results, this is user-defined
category_type: formula # category_type is mainly used for selecting data preprocessing strategy, options: formula/text
For the metrics
section, in addition to the supported metrics, it also supports exporting formats required for CDM evaluation. Simply configure the CDM field in metrics to organize the output into CDM input format, which will be stored in result.
For the dataset
section, the data format in the ground_truth
data_path
remains consistent with OmniDocBench, with just a custom field added under the corresponding formula sample to store the model's prediction results. The field storing prediction information is specified through the data_key
under the prediction
field in dataset
, such as pred
. For more details about OmniDocBench's file structure, please refer to the "Dataset Introduction" section. The input format for model results can be found in OmniDocBench_demo_formula, which follows this format:
[{
"layout_dets": [ // List of page elements
{
"category_type": "equation_isolated", // OmniDocBench category name
"poly": [ // OmniDocBench position info, coordinates for top-left, top-right, bottom-right, bottom-left corners (x,y)
136.0,
781.0,
340.0,
781.0,
340.0,
806.0,
136.0,
806.0
],
... // Other OmniDocBench fields
"latex": "$xxx$", // LaTeX formula will be written here
"pred": "$xxx$", // !! Model prediction result stored here, user-defined new field at same level as ground truth
...
],
"page_info": {...}, // OmniDocBench page information
"extra": {...} // OmniDocBench annotation relationship information
},
...
]
Here is a model inference script for reference:
import os
import json
from PIL import Image
def poly2bbox(poly):
L = poly[0]
U = poly[1]
R = poly[2]
D = poly[5]
L, R = min(L, R), max(L, R)
U, D = min(U, D), max(U, D)
bbox = [L, U, R, D]
return bbox
question = "<image>\nPlease convert this cropped image directly into latex."
with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
samples = json.load(f)
for sample in samples:
img_name = os.path.basename(sample['page_info']['image_path'])
img_path = os.path.join('./Docparse/images', img_name)
img = Image.open(img_path)
if not os.path.exists(img_path):
print('No exist: ', img_name)
continue
for i, anno in enumerate(sample['layout_dets']):
if anno['category_type'] != 'equation_isolated': # Filter out equation_isolated category for evaluation
continue
bbox = poly2bbox(anno['poly'])
im = img.crop(bbox).convert('RGB')
response = model.chat(im, question) # Modify the way the image is passed in according to the model
anno['pred'] = response # Directly add a new field to store the model's inference results under the corresponding annotation
with open('./demo_data/recognition/OmniDocBench_demo_formula.json', 'w', encoding='utf-8') as f:
json.dump(samples, f, ensure_ascii=False)
OmniDocBench contains bounding box information and corresponding text recognition annotations for all text in each PDF page, making it suitable as a benchmark for OCR evaluation. The text annotations include both block-level and span-level annotations, both of which can be used for evaluation. This repo currently provides an example of block-level evaluation, which evaluates OCR at the text paragraph level.
Model Type | Model | Language | Text background | Text Rotate | |||||||
---|---|---|---|---|---|---|---|---|---|---|---|
EN | ZH | Mixed | White | Single | Multi | Normal | Rotate90 | Rotate270 | Horizontal | ||
Expert Vision Models | PaddleOCR | 0.071 | 0.055 | 0.118 | 0.060 | 0.038 | 0.0848 | 0.060 | 0.015 | 0.285 | 0.021 |
Tesseract OCR | 0.179 | 0.553 | 0.553 | 0.453 | 0.463 | 0.394 | 0.448 | 0.369 | 0.979 | 0.982 | |
Surya | 0.057 | 0.123 | 0.164 | 0.093 | 0.186 | 0.235 | 0.104 | 0.634 | 0.767 | 0.255 | |
GOT-OCR | 0.041 | 0.112 | 0.135 | 0.092 | 0.052 | 0.155 | 0.091 | 0.562 | 0.966 | 0.097 | |
Mathpix | 0.033 | 0.240 | 0.261 | 0.185 | 0.121 | 0.166 | 0.180 | 0.038 | 0.185 | 0.638 | |
Vision Language Models | Qwen2-VL-72B | 0.072 | 0.274 | 0.286 | 0.234 | 0.155 | 0.148 | 0.223 | 0.273 | 0.721 | 0.067 |
InternVL2-Llama3-76B | 0.074 | 0.155 | 0.242 | 0.113 | 0.352 | 0.269 | 0.132 | 0.610 | 0.907 | 0.595 | |
GPT4o | 0.020 | 0.224 | 0.125 | 0.167 | 0.140 | 0.220 | 0.168 | 0.115 | 0.718 | 0.132 |
OCR text recognition evaluation can be configured according to ocr.
The field explanation of ocr.yaml
The configuration file for ocr.yaml
is as follows:
recogition_eval: # Specify task name, common for all recognition-related tasks
metrics: # Configure metrics to use
- Edit_dist # Normalized Edit Distance
- BLEU
- METEOR
dataset: # Dataset configuration
dataset_name: omnidocbench_single_module_dataset # Dataset name, no need to modify if following the specified input format
ground_truth: # Ground truth dataset configuration
data_path: ./demo_data/recognition/OmniDocBench_demo_text_ocr.json # JSON file containing both ground truth and model prediction results
data_key: text # Field name storing Ground Truth, for OmniDocBench, text recognition results are stored in the text field, all block level annotations containing text field will participate in evaluation
prediction: # Model prediction configuration
data_key: pred # Field name storing model prediction results, this is user-defined
category_type: text # category_type is mainly used for selecting data preprocessing strategy, options: formula/text
For the dataset
section, the input ground_truth
data_path
follows the same data format as OmniDocBench, with just a new custom field added under samples containing the text field to store the model's prediction results. The field storing prediction information is specified through the data_key
under the prediction
field in dataset
, for example pred
. The input format of the dataset can be referenced in OmniDocBench_demo_text_ocr, and the meanings of various fields can be found in the examples provided in the Formula Recognition Evaluation section.
Here is a reference model inference script for your consideration:
import os
import json
from PIL import Image
def poly2bbox(poly):
L = poly[0]
U = poly[1]
R = poly[2]
D = poly[5]
L, R = min(L, R), max(L, R)
U, D = min(U, D), max(U, D)
bbox = [L, U, R, D]
return bbox
question = "<image>\nPlease convert this cropped image directly into latex."
with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
samples = json.load(f)
for sample in samples:
img_name = os.path.basename(sample['page_info']['image_path'])
img_path = os.path.join('./Docparse/images', img_name)
img = Image.open(img_path)
if not os.path.exists(img_path):
print('No exist: ', img_name)
continue
for i, anno in enumerate(sample['layout_dets']):
if not anno.get('text'): # Filter out annotations containing the text field from OmniDocBench for model inference
continue
bbox = poly2bbox(anno['poly'])
im = img.crop(bbox).convert('RGB')
response = model.chat(im, question) # Modify the way the image is passed in according to the model
anno['pred'] = response # Directly add a new field to store the model's inference results under the corresponding annotation
with open('./demo_data/recognition/OmniDocBench_demo_text_ocr.json', 'w', encoding='utf-8') as f:
json.dump(samples, f, ensure_ascii=False)
OmniDocBench contains bounding box information for tables on each PDF page along with corresponding table recognition annotations, making it suitable as a benchmark for table recognition evaluation. The table annotations are available in both HTML and LaTeX formats, with this repository currently providing examples for HTML format evaluation.
Model Type | Model | Language | Table Frame Type | Special Situation | Overall | ||||||||
---|---|---|---|---|---|---|---|---|---|---|---|---|---|
EN | ZH | Mixed | Full | Omission | Three | Zero | Merge Cell(+/-) | Formula(+/-) | Colorful(+/-) | Rotate(+/-) | |||
OCR-based Models | PaddleOCR | 76.8 | 71.8 | 80.1 | 67.9 | 74.3 | 81.1 | 74.5 | 70.6/75.2 | 71.3/74.1 | 72.7/74.0 | 23.3/74.6 | 73.6 |
RapidTable | 80.0 | 83.2 | 91.2 | 83.0 | 79.7 | 83.4 | 78.4 | 77.1/85.4 | 76.7/83.9 | 77.6/84.9 | 25.2/83.7 | 82.5 | |
Expert VLMs | StructEqTable | 72.0 | 72.6 | 81.7 | 68.8 | 64.3 | 80.7 | 85.0 | 65.1/76.8 | 69.4/73.5 | 66.8/75.7 | 44.1/73.3 | 72.7 |
GOT-OCR | 72.2 | 75.5 | 85.4 | 73.1 | 72.7 | 78.2 | 75.7 | 65.0/80.2 | 64.3/77.3 | 70.8/76.9 | 8.5/76.3 | 74.9 | |
General VLMs | Qwen2-VL-7B | 70.2 | 70.7 | 82.4 | 70.2 | 62.8 | 74.5 | 80.3 | 60.8/76.5 | 63.8/72.6 | 71.4/70.8 | 20.0/72.1 | 71.0 |
InternVL2-8B | 70.9 | 71.5 | 77.4 | 69.5 | 69.2 | 74.8 | 75.8 | 58.7/78.4 | 62.4/73.6 | 68.2/73.1 | 20.4/72.6 | 71.5 |
Component-level Table Recognition evaluation on OmniDocBench table subset. (+/-) means with/without special situation.
Table recognition evaluation can be configured according to table_recognition.
For tables predicted to be in LaTeX format, the latexml tool will be used to convert LaTeX to HTML before evaluation. The evaluation code will automatically perform format conversion, and users need to preinstall latexml
The field explanation of table_recognition.yaml
The configuration file for table_recognition.yaml
is as follows:
recogition_eval: # Specify task name, common for all recognition-related tasks
metrics: # Configure metrics to use
- TEDS # Tree Edit Distance based Similarity
- Edit_dist # Normalized Edit Distance
dataset: # Dataset configuration
dataset_name: omnidocbench_single_module_dataset # Dataset name, no need to modify if following specified input format
ground_truth: # Configuration for ground truth dataset
data_path: ./demo_data/recognition/OmniDocBench_demo_table.json # JSON file containing both ground truth and model prediction results
data_key: html # Field name storing Ground Truth, for OmniDocBench, table recognition results are stored in html and latex fields, change to latex when evaluating latex format tables
category_filter: table # Category for evaluation, in table recognition, the category_name is table
prediction: # Configuration for model prediction results
data_key: pred # Field name storing model prediction results, this is user-defined
category_type: table # category_type is mainly used for data preprocessing strategy selection
For the dataset
section, the data format in the ground_truth
's data_path
remains consistent with OmniDocBench, with only a custom field added under the corresponding table sample to store the model's prediction result. The field storing prediction information is specified through data_key
under the prediction
field in dataset
, such as pred
. For more details about OmniDocBench's file structure, please refer to the "Dataset Introduction" section. The input format for model results can be found in OmniDocBench_demo_table, which follows this format:
[{
"layout_dets": [ // List of page elements
{
"category_type": "table", // OmniDocBench category name
"poly": [ // OmniDocBench position info: x,y coordinates for top-left, top-right, bottom-right, bottom-left corners
136.0,
781.0,
340.0,
781.0,
340.0,
806.0,
136.0,
806.0
],
... // Other OmniDocBench fields
"latex": "$xxx$", // Table LaTeX annotation goes here
"html": "$xxx$", // Table HTML annotation goes here
"pred": "$xxx$", // !! Model prediction result stored here, user-defined new field at same level as ground truth
...
],
"page_info": {...}, // OmniDocBench page information
"extra": {...} // OmniDocBench annotation relationship information
},
...
]
Here is a model inference script for reference:
import os
import json
from PIL import Image
def poly2bbox(poly):
L = poly[0]
U = poly[1]
R = poly[2]
D = poly[5]
L, R = min(L, R), max(L, R)
U, D = min(U, D), max(U, D)
bbox = [L, U, R, D]
return bbox
question = "<image>\nPlease convert this cropped image directly into html format of table."
with open('./demo_data/omnidocbench_demo/OmniDocBench_demo.json', 'r') as f:
samples = json.load(f)
for sample in samples:
img_name = os.path.basename(sample['page_info']['image_path'])
img_path = os.path.join('./demo_data/omnidocbench_demo/images', img_name)
img = Image.open(img_path)
if not os.path.exists(img_path):
print('No exist: ', img_name)
continue
for i, anno in enumerate(sample['layout_dets']):
if anno['category_type'] != 'table': # Filter out the table category for evaluation
continue
bbox = poly2bbox(anno['poly'])
im = img.crop(bbox).convert('RGB')
response = model.chat(im, question) # Need to modify the way the image is passed in depending on the model
anno['pred'] = response # Directly add a new field to store the model's inference result at the same level as the ground truth
with open('./demo_data/recognition/OmniDocBench_demo_table.json', 'w', encoding='utf-8') as f:
json.dump(samples, f, ensure_ascii=False)
OmniDocBench contains bounding box information for all document components on each PDF page, making it suitable as a benchmark for layout detection task evaluation.
Model | Book | Slides | Research Report | Textbook | Exam Paper | Magazine | Academic Literature | Notes | Newspaper | Average mAP |
---|---|---|---|---|---|---|---|---|---|---|
DiT-L | 43.44 | 13.72 | 45.85 | 15.45 | 3.40 | 29.23 | 66.13 | 0.21 | 23.65 | 26.90 |
LayoutLMv3 | 42.12 | 13.63 | 43.22 | 21.00 | 5.48 | 31.81 | 64.66 | 0.80 | 30.84 | 28.84 |
DOCX-Chain | 30.86 | 11.71 | 39.62 | 19.23 | 10.67 | 23.00 | 41.60 | 1.80 | 16.96 | 21.27 |
DocLayout-YOLO | 43.71 | 48.71 | 72.83 | 42.67 | 35.40 | 51.44 | 66.84 | 9.54 | 57.54 | 48.71 |
Layout detection config file reference layout_detection, data format reference detection_prediction.
The field explanation of layout_detection.yaml
Here is the configuration file for layout_detection.yaml
:
detection_eval: # Specify task name, common for all detection-related tasks
metrics:
- COCODet # Detection task related metrics, mainly mAP, mAR etc.
dataset:
dataset_name: detection_dataset_simple_format # Dataset name, no need to modify if following specified input format
ground_truth:
data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json # Path to OmniDocBench JSON file
prediction:
data_path: ./demo_data/detection/detection_prediction.json # Path to model prediction result JSON file
filter: # Page level filtering
data_source: exam_paper # Page attributes and corresponding tags to be evaluated
categories:
eval_cat: # Categories participating in final evaluation
block_level: # Block level categories, see OmniDocBench evaluation set introduction for details
- title # Title
- text # Text
- abandon # Includes headers, footers, page numbers, and page annotations
- figure # Image
- figure_caption # Image caption
- table # Table
- table_caption # Table caption
- table_footnote # Table footnote
- isolate_formula # Display formula (this is a layout display formula, lower priority than 14)
- formula_caption # Display formula label
gt_cat_mapping: # Mapping table from ground truth to final evaluation categories, key is ground truth category, value is final evaluation category name
figure_footnote: figure_footnote
figure_caption: figure_caption
page_number: abandon
header: abandon
page_footnote: abandon
table_footnote: table_footnote
code_txt: figure
equation_caption: formula_caption
equation_isolated: isolate_formula
table: table
refernece: text
table_caption: table_caption
figure: figure
title: title
text_block: text
footer: abandon
pred_cat_mapping: # Mapping table from prediction to final evaluation categories, key is prediction category, value is final evaluation category name
title : title
plain text: text
abandon: abandon
figure: figure
figure_caption: figure_caption
table: table
table_caption: table_caption
table_footnote: table_footnote
isolate_formula: isolate_formula
formula_caption: formula_caption
The filter
field can be used to filter the dataset. For example, setting the filter
field under dataset
to data_source: exam_paper
will filter for pages with data type exam_paper. For more page attributes, please refer to the "Evaluation Set Introduction" section. If you want to evaluate the full dataset, comment out the filter
related fields.
The data_path
under the prediction
section in dataset
takes the model's prediction as input, with the following data format:
{
"results": [
{
"image_name": "docstructbench_llm-raw-scihub-o.O-adsc.201190003.pdf_6", // image name
"bbox": [53.892921447753906, 909.8675537109375, 808.5555419921875, 1006.2714233398438], // bounding box coordinates, representing x,y coordinates of top-left and bottom-right corners
"category_id": 1, // category ID number
"score": 0.9446213841438293 // confidence score
},
... // all bounding boxes are flattened in a single list
],
"categories": {"0": "title", "1": "plain text", "2": "abandon", ...} // mapping between category IDs and category names
OmniDocBench contains bounding box information for each formula on each PDF page, making it suitable as a benchmark for formula detection task evaluation.
The format for formula detection is essentially the same as layout detection. Formulas include both inline and display formulas. In this section, we provide a config example that can evaluate detection results for both display formulas and inline formulas simultaneously. Formula detection can be configured according to formula_detection.
The field explanation of formula_detection.yaml
Here is the configuration file for formula_detection.yaml
:
detection_eval: # Specify task name, common for all detection-related tasks
metrics:
- COCODet # Detection task related metrics, mainly mAP, mAR etc.
dataset:
dataset_name: detection_dataset_simple_format # Dataset name, no need to modify if following specified input format
ground_truth:
data_path: ./demo_data/omnidocbench_demo/OmniDocBench_demo.json # Path to OmniDocBench JSON file
prediction:
data_path: ./demo_data/detection/detection_prediction.json # Path to model prediction JSON file
filter: # Page-level filtering
data_source: exam_paper # Page attributes and corresponding tags to evaluate
categories:
eval_cat: # Categories participating in final evaluation
block_level: # Block level categories, see OmniDocBench dataset intro for details
- isolate_formula # Display formula
span_level: # Span level categories, see OmniDocBench dataset intro for details
- inline_formula # Inline formula
gt_cat_mapping: # Mapping table from ground truth to final evaluation categories, key is ground truth category, value is final evaluation category name
equation_isolated: isolate_formula
equation_inline: inline_formula
pred_cat_mapping: # Mapping table from prediction to final evaluation categories, key is prediction category, value is final evaluation category name
interline_formula: isolate_formula
inline_formula: inline_formula
Please refer to the Layout Detection
section for parameter explanations and dataset format. The main difference between formula detection and layout detection is that under the eval_cat
category that participates in the final evaluation, a span_level
category inline_formula
has been added. Both span_level and block_level categories will participate together in the evaluation.
We provide several tools in the tools
directory:
- json2md for converting OmniDocBench from JSON format to Markdown format;
- visualization for visualizing OmniDocBench JSON files;
- The model_infer folder provides some model inference scripts for reference, including:
- mathpix_img2md.py for calling mathpix API to convert images to Markdown format;
- internvl2_test_img2md.py for using InternVL2 model to convert images to Markdown format, please use after configuring the InternVL2 model environment;
- GOT_img2md.py for using GOT-OCR model to convert images to Markdown format, please use after configuring the GOT-OCR model environment;
- Qwen2VL_img2md.py for using QwenVL model to convert images to Markdown format, please use after configuring the QwenVL model environment;
- Integration of
match_full
algorithm - Optimization of matching post-processing for model-specific output formats
- Addition of Unicode mapping table for special characters
- Some models occasionally produce non-standard output formats (e.g., recognizing multi-column text as tables, or formulas as Unicode text), leading to matching failures. This can be optimized through post-processing of model output formats
- Due to varying symbol recognition capabilities across different models, some symbols are recognized inconsistently (e.g., list identifiers). Currently, only Chinese and English text are included in text evaluation. A Unicode mapping table will be added later for optimization
We welcome everyone to use the OmniDocBench dataset and provide valuable feedback and suggestions to help us continuously improve the dataset quality and evaluation tools. For any comments or suggestions, please feel free to open an issue and we will respond promptly. If you have evaluation scheme optimizations, you can submit a PR and we will review and update in a timely manner.
- Thank Abaka AI for supporting the dataset annotation.
- PubTabNet for TEDS metric calculation
- latexml LaTeX to HTML conversion tool
- Tester Markdown table to HTML conversion tool
The PDFs are collected from public online channels and community user contributions. Content that is not allowed for distribution has been removed. The dataset is for research purposes only and not for commercial use. If there are any copyright concerns, please contact [email protected].
@misc{ouyang2024omnidocbenchbenchmarkingdiversepdf,
title={OmniDocBench: Benchmarking Diverse PDF Document Parsing with Comprehensive Annotations},
author={Linke Ouyang and Yuan Qu and Hongbin Zhou and Jiawei Zhu and Rui Zhang and Qunshu Lin and Bin Wang and Zhiyuan Zhao and Man Jiang and Xiaomeng Zhao and Jin Shi and Fan Wu and Pei Chu and Minghao Liu and Zhenxiang Li and Chao Xu and Bo Zhang and Botian Shi and Zhongying Tu and Conghui He},
year={2024},
eprint={2412.07626},
archivePrefix={arXiv},
primaryClass={cs.CV},
url={https://arxiv.org/abs/2412.07626},
}