Immersive Multimodal Translation: A Proxy Task for Cross-modal and Objective Evaluation of Unified Models

Immersive Multimodal Translation: A Proxy Task for Cross-modal and Objective Evaluation of Unified Models

ICLR 2026 Conference Submission277 Authors

01 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Multi-modal Machine Translation, Unified Multi-modal Models

Abstract: Unified multimodal models that jointly perform image understanding and generation have achieved substantial progress. However, a critical challenge persists in establishing rigorous evaluation protocols. Existing benchmarks typically assess generation and understanding tasks independently and rely on large multi-modal language models (MLLMs) for scoring. Such approaches introduce language-centric biases and lack objective ground truth, thereby limiting the reliability and fairness of model assessment. To address this, we propose Immersive Multi-modal Translation (IMT), a novel proxy task that requires models to translate textual content within images while preserving visual context. IMT naturally captures cross-modal synergy between understanding and generation, while enabling transparent, objective evaluation through established metrics from natural language processing and computer vision. To support systematic study, we construct IMTBench, a benchmark spanning three scenarios, including document, webpage, and scene image, with nine languages, and 2,000 carefully curated samples. IMTBench incorporates a three-dimensional evaluation framework measuring translation quality, background fidelity, and visual text rendering accuracy. Extensive experiments across diverse unified multi-modal architectures, supported by a companion dataset IMT-1M, reveal that current open-source models still fall significantly short of commercial expert systems. By providing objective, cross-modal evaluation protocols, we believe that IMT and IMTBench can offer actionable guidance for future research in unified multi-modal intelligence.

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 277

Loading