Reading When Translating: Multi-Modal Document Image Machine Translation With Reading Flow Prediction

Zhiyang Zhang, Yaping Zhang, Yupu Liang, Cong Ma, Lu Xiang, Yang Zhao, Yu Zhou, Chengqing Zong

Published: 01 Jan 2025, Last Modified: 04 Nov 2025IEEE Transactions on Audio, Speech and Language ProcessingEveryoneRevisionsCC BY-SA 4.0

Abstract: Document Image Translation (DIT) aims to translate documents in images from one language to another. It is a multi-modal task that involves the cooperation of text, visual layout, and reading logical order. However, existing text-based or vision-based methods rely on solely textual or visual features. Layout-based methods are multi-modal but largely overlook the crucial reading logical order. To fully leverage the multi-modal information and exploit explicit modules to learn better reading logical order for DIT, this paper proposes the “reading-when-translating” guideline. It collaborates the translation process with an auxiliary “reading” process such that reading logical order directly contributes to translation. Following this guideline, we propose Document Reading and Translation Network (DocRTN), a novel unified framework that seamlessly integrates reading order into DIT, enabling the model to handle complex layouts by “reading” text in a human-like, coherent sequence. The unified framework comprises a reading flow decoder and a translation decoder. A novel feature decorator is proposed to harmonize the reading and translation channels, ensuring reading features are optimally adapted for translation. Extensive comparisons and analysis on 5 domains and 4 translation directions demonstrate DocRTN outperforms previous state-of-the-arts in all aspects.

External IDs:doi:10.1109/taslpro.2025.3578754