Reading When Translating: Multi-Modal Document Image Machine Translation With Reading Flow Prediction
Abstract: Document Image Translation (DIT) aims to translate documents in images from one language to another. It is a multi-modal task that involves the cooperation of text, visual layout, and reading logical order. However, existing text-based or vision-based methods rely on solely textual or visual features. Layout-based methods are multi-modal but largely overlook the crucial reading logical order. To fully leverage the multi-modal information and exploit explicit modules to learn better reading logical order for DIT, this paper proposes the “reading-when-translating” guideline. It collaborates the translation process with an auxiliary “reading” process such that reading logical order directly contributes to translation. Following this guideline, we propose Document Reading and Translation Network (DocRTN), a novel unified framework that seamlessly integrates reading order into DIT, enabling the model to handle complex layouts by “reading” text in a human-like, coherent sequence. The unified framework comprises a reading flow decoder and a translation decoder. A novel feature decorator is proposed to harmonize the reading and translation channels, ensuring reading features are optimally adapted for translation. Extensive comparisons and analysis on 5 domains and 4 translation directions demonstrate DocRTN outperforms previous state-of-the-arts in all aspects.
External IDs:doi:10.1109/taslpro.2025.3578754
Loading