Skip to content

Latest commit

 

History

History
219 lines (162 loc) · 5.73 KB

docling_parser.md

File metadata and controls

219 lines (162 loc) · 5.73 KB

Docling Parser

Intro

The Docling Parser is an advanced PDF document parser based on IBM's docling document processing pipeline. As of 3.0.0-alpha1, it is the primary parser for PDF documents in OpenContracts.

Perhaps its coolest feature, besides its ability to support multiple cutting-edge OCR engines and numerous formats, is its ability to group document features into groups. We've found this to be particularly useful for contract layouts and setup our Docling integration to import these groups as OpenContract "relationships" - which, if you're not familiar, map N source annotations to N target annotations. In the case of the Docling parser, these look like this (AWESOME):

Docling Group Relationships

Overview

sequenceDiagram
    participant U as User
    participant D as DoclingParser
    participant DC as DocumentConverter
    participant OCR as Tesseract OCR
    participant DB as Database

    U->>D: parse_document(user_id, doc_id)
    D->>DB: Load document
    D->>DC: Convert PDF

    alt PDF needs OCR
        DC->>OCR: Process PDF
        OCR-->>DC: OCR results
    end

    DC-->>D: DoclingDocument
    D->>D: Process structure
    D->>D: Generate PAWLS tokens
    D->>D: Build relationships
    D-->>U: OpenContractDocExport
Loading

Features

  • Intelligent OCR: Automatically detects when OCR is needed
  • Hierarchical Structure: Extracts document structure (headings, paragraphs, lists)
  • Token-based Annotations: Creates precise token-level annotations
  • Relationship Detection: Builds relationships between document elements
  • PAWLS Integration: Generates PAWLS-compatible token data

Configuration

The Docling Parser requires model files to be present in the path specified by DOCLING_MODELS_PATH in your settings:

DOCLING_MODELS_PATH = env.str("DOCLING_MODELS_PATH", default="/models/docling")

NOTE This should be moved to settings.PARSER_KWARGS in near future.

Usage

Basic usage:

from opencontractserver.pipeline.parsers.docling_parser import DoclingParser

parser = DoclingParser()
result = parser.parse_document(user_id=1, doc_id=123)

With options:

result = parser.parse_document(
    user_id=1,
    doc_id=123,
    force_ocr=True,  # Force OCR processing
    roll_up_groups=True,  # Combine related items into groups
)

Input

The parser expects:

  • A PDF document stored in Django's storage system
  • A valid user ID and document ID
  • Optional configuration parameters

Output

The parser returns an OpenContractDocExport dictionary containing:

{
    "title": str,  # Extracted document title
    "description": str,  # Generated description
    "content": str,  # Full text content
    "page_count": int,  # Number of pages
    "pawls_file_content": List[dict],  # PAWLS token data
    "labelled_text": List[dict],  # Structural annotations
    "relationships": List[dict],  # Relationships between annotations
}

Processing Steps

  1. Document Loading

    • Loads PDF from storage
    • Creates DocumentStream for processing
  2. Conversion

    • Converts PDF using Docling's DocumentConverter
    • Applies OCR if needed
    • Extracts document structure
  3. Token Generation

    • Creates PAWLS-compatible tokens
    • Builds spatial indices for token lookup
    • Transforms coordinates to screen space
  4. Annotation Creation

    • Converts Docling items to annotations
    • Assigns hierarchical relationships
    • Creates group relationships
  5. Metadata Extraction

    • Extracts document title
    • Generates description
    • Counts pages

Advanced Features

OCR Processing

The parser can use Tesseract OCR when needed:

# Force OCR processing
result = parser.parse_document(user_id=1, doc_id=123, force_ocr=True)

Group Relationships

Enable group relationship detection:

# Enable group rollup
result = parser.parse_document(user_id=1, doc_id=123, roll_up_groups=True)

Spatial Processing

The parser uses Shapely for spatial operations:

  • Creates STRtrees for efficient spatial queries
  • Handles coordinate transformations
  • Manages token-annotation mapping

Error Handling

The parser includes robust error handling:

  • Validates model file presence
  • Checks conversion status
  • Handles OCR failures
  • Manages coordinate transformation errors

Dependencies

Required Python packages:

  • docling: Core document processing
  • pytesseract: OCR support
  • pdf2image: PDF rendering
  • shapely: Spatial operations
  • numpy: Numerical operations

Performance Considerations

  • OCR processing can be time-intensive
  • Large documents may require significant memory
  • Spatial indexing improves token lookup performance
  • Group relationship processing may impact performance with roll_up_groups=True

Best Practices

  1. OCR Usage

    • Let the parser auto-detect OCR needs
    • Only use force_ocr=True when necessary
  2. Group Relationships

    • Start with roll_up_groups=False
    • Enable if hierarchical grouping is needed
  3. Error Handling

    • Always check return values
    • Monitor logs for conversion issues
  4. Memory Management

    • Process large documents in batches
    • Monitor memory usage with large PDFs

Troubleshooting

Common issues and solutions:

  1. Missing Models

    FileNotFoundError: Docling models path does not exist
    
    • Verify DOCLING_MODELS_PATH setting
    • Check model file permissions
  2. OCR Failures

    Error: Tesseract not found
    
    • Install Tesseract OCR
    • Verify system PATH
  3. Memory Issues

    MemoryError during processing
    
    • Reduce concurrent processing
    • Increase system memory
    • Process smaller batches