Understand Layout and Translate Text: Unified Feature-Conductive End-to-End Document Image Translation

Published: 2025, Last Modified: 04 Nov 2025IEEE Trans. Pattern Anal. Mach. Intell. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Document Image Translation (DIT) aims to translate texts on document images from one language to another. It is a multi-modal task involving cooperation of text and layout. Current approaches either handle layout and translation as separate processes, risking accumulative errors, or use vanilla end-to-end encoder-decoder models to capture layout implicitly, often suffering inadequate layout incorporation. We argue that a favorable framework should explicitly engage layout-specific modules and properly organize them toward translation. For this, we first revisit two key layouts: the geometric layout reflecting word’s spatial positions, and the logical layout depicting word’s logical order. Then, a novel pipeline (understand layout $\rightarrow$ translate text) is determined to prioritize layouts such that preceding layouts contribute to translation. Following this pipeline, we introduce Unified Document Image Translation (UniDIT), a comprehensive framework that unifies layout with translation in one network. It is devised to leverage each module’s advantage, and provide an elaborate feature-conductive flow for module communication globally. A novel bridging mechanism is also introduced to adapt layout features conducive to translation. We further contribute DITransv2, a large-scale fine-grained benchmark that includes heterogeneous and complex document layouts. Extensive experiments on DITransv2 and additional established benchmarks demonstrate UniDIT outperforms previous state-of-the-arts in all aspects.
Loading