Abstract: Document Layout Analysis is typically formulated as an object detection task. However, most existing approaches are adapted from general-purpose detection frameworks and overlook the fundamental structural differences between document and natural images. To meet the needs of human reading habits, document images are two-dimensional and free from occlusion. Based on this observation, we propose DEtection ENcoder (DEEN), which reformulates document layout analysis as a graph connectivity prediction task, thereby eliminating the need for both Non-Maximum Suppression (NMS) and confidence thresholding in post-processing. To efficiently model high-resolution feature maps, DEEN combines global sparse and local dense attention for unified representation of overall layout and fine-grained details. Since DEEN does not rely on confidence scores, we evaluate it under two settings: one that favors confidence-based models, and another that simulates real-world usage scenarios. DEEN achieves competitive performance on three structurally diverse datasets, demonstrating strong generalization.
Paper Type: Long
Research Area: Information Extraction
Research Area Keywords: Document Layout Analysis
Contribution Types: Publicly available software and/or pre-trained models
Languages Studied: The dataset used contains multiple languages, including English, Arabic, and Chinese.
Submission Number: 1700
Loading