The Docling Parser is an advanced PDF document parser based on IBM's docling document processing pipeline. As of 3.0.0-alpha1
, it is the primary parser for PDF documents in OpenContracts.
Perhaps its coolest feature, besides its ability to support multiple cutting-edge OCR engines and numerous formats, is its ability to group document features into groups. We've found this to be particularly useful for contract layouts and setup our Docling integration to import these groups as OpenContract "relationships" - which, if you're not familiar, map N source annotations to N target annotations. In the case of the Docling parser, these look like this (AWESOME):
sequenceDiagram
participant U as User
participant D as DoclingParser
participant DC as DocumentConverter
participant OCR as Tesseract OCR
participant DB as Database
U->>D: parse_document(user_id, doc_id)
D->>DB: Load document
D->>DC: Convert PDF
alt PDF needs OCR
DC->>OCR: Process PDF
OCR-->>DC: OCR results
end
DC-->>D: DoclingDocument
D->>D: Process structure
D->>D: Generate PAWLS tokens
D->>D: Build relationships
D-->>U: OpenContractDocExport
- Intelligent OCR: Automatically detects when OCR is needed
- Hierarchical Structure: Extracts document structure (headings, paragraphs, lists)
- Token-based Annotations: Creates precise token-level annotations
- Relationship Detection: Builds relationships between document elements
- PAWLS Integration: Generates PAWLS-compatible token data
The Docling Parser requires model files to be present in the path specified by DOCLING_MODELS_PATH
in your settings:
DOCLING_MODELS_PATH = env.str("DOCLING_MODELS_PATH", default="/models/docling")
NOTE This should be moved to settings.PARSER_KWARGS in near future.
Basic usage:
from opencontractserver.pipeline.parsers.docling_parser import DoclingParser
parser = DoclingParser()
result = parser.parse_document(user_id=1, doc_id=123)
With options:
result = parser.parse_document(
user_id=1,
doc_id=123,
force_ocr=True, # Force OCR processing
roll_up_groups=True, # Combine related items into groups
)
The parser expects:
- A PDF document stored in Django's storage system
- A valid user ID and document ID
- Optional configuration parameters
The parser returns an OpenContractDocExport
dictionary containing:
{
"title": str, # Extracted document title
"description": str, # Generated description
"content": str, # Full text content
"page_count": int, # Number of pages
"pawls_file_content": List[dict], # PAWLS token data
"labelled_text": List[dict], # Structural annotations
"relationships": List[dict], # Relationships between annotations
}
-
Document Loading
- Loads PDF from storage
- Creates DocumentStream for processing
-
Conversion
- Converts PDF using Docling's DocumentConverter
- Applies OCR if needed
- Extracts document structure
-
Token Generation
- Creates PAWLS-compatible tokens
- Builds spatial indices for token lookup
- Transforms coordinates to screen space
-
Annotation Creation
- Converts Docling items to annotations
- Assigns hierarchical relationships
- Creates group relationships
-
Metadata Extraction
- Extracts document title
- Generates description
- Counts pages
The parser can use Tesseract OCR when needed:
# Force OCR processing
result = parser.parse_document(user_id=1, doc_id=123, force_ocr=True)
Enable group relationship detection:
# Enable group rollup
result = parser.parse_document(user_id=1, doc_id=123, roll_up_groups=True)
The parser uses Shapely for spatial operations:
- Creates STRtrees for efficient spatial queries
- Handles coordinate transformations
- Manages token-annotation mapping
The parser includes robust error handling:
- Validates model file presence
- Checks conversion status
- Handles OCR failures
- Manages coordinate transformation errors
Required Python packages:
docling
: Core document processingpytesseract
: OCR supportpdf2image
: PDF renderingshapely
: Spatial operationsnumpy
: Numerical operations
- OCR processing can be time-intensive
- Large documents may require significant memory
- Spatial indexing improves token lookup performance
- Group relationship processing may impact performance with
roll_up_groups=True
-
OCR Usage
- Let the parser auto-detect OCR needs
- Only use
force_ocr=True
when necessary
-
Group Relationships
- Start with
roll_up_groups=False
- Enable if hierarchical grouping is needed
- Start with
-
Error Handling
- Always check return values
- Monitor logs for conversion issues
-
Memory Management
- Process large documents in batches
- Monitor memory usage with large PDFs
Common issues and solutions:
-
Missing Models
FileNotFoundError: Docling models path does not exist
- Verify DOCLING_MODELS_PATH setting
- Check model file permissions
-
OCR Failures
Error: Tesseract not found
- Install Tesseract OCR
- Verify system PATH
-
Memory Issues
MemoryError during processing
- Reduce concurrent processing
- Increase system memory
- Process smaller batches