Virtual Visual-Guided Domain-Shadow Fusion via Modal Exchanging for Domain-Specific Multi-Modal Neural Machine Translation
Abstract: Incorporating domain-specific visual information into text poses one of the critical challenges for domain-specific multi-modal neural machine translation (DMNMT). While most existing DMNMT methods often borrow multi-modal fusion frameworks from multi-modal neural machine translation (MNMT) in the general domain, they overlook the domain gaps between general and specific domains. Visual-to-textual interaction in a specific domain frequently exhibits multi-focus characteristics, making it difficult to consistently focus on domain-specific multi-visual details using traditional multi-modal fusion frameworks. This challenge can lead to a decrease in machine translation performance for domain-specific terms. To tackle this problem, this paper presents a virtual visual scene-guided domain-shadow multi-modal fusion mechanism to simultaneously integrate multi-grained domain visual details and text with the guidance of modality-agnostic virtual visual scene, thereby enhancing machine translation performance for DMNMT, especially for domain terms. Specifically, we first adopt a modality-mixing selection-voting strategy to generate modality-mixed domain-shadow representations through layer-by-layer intra-modality selection and inter-modality exchanging. Then, we gradually aggregate modality-mixed domain representations and text across modality boundaries with the guidance of a modality-agnostic virtual visual scene to enhance the collaboration between domain characteristics and textual semantics. The experimental results on three benchmark datasets demonstrate that our proposed approach outperforms the state-of-the-art (SOTA) methods in all machine translation tasks. The in-depth analysis further highlights the robustness and generalizability of our approach across various scenarios. Our code is available on https://github.com/HZY2023/VVDF.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Multimodal Fusion
Relevance To Conference: This work leads the way in multimedia and multimodal processing by addressing a key challenge in domain-specific multi-modal neural machine translation (DMNMT)—integrating domain-specific visual information with text. It introduces a novel cross-modal fusion strategy, called inter-modality feature exchanging, which provides new insights into multimodal fusion. The proposed virtual visual scene-guided domain-shadow fusion mechanism greatly improves the interaction between visual details and textual semantics, essential for multimedia applications that require accurate, contextually relevant translations of domain-specific terms. Alongside a modality-mixing selection-voting approach and the guidance of a modality-agnostic virtual visual scene, this strategy advances our multimodal fusion framework, demonstrating significant improvements in domain-specific and ambiguous translation tasks. Moreover, the robustness and wide applicability of our method are confirmed by its superior performance in various settings, including general-domain image-text pairings, noisy visual environments, and text-only scenarios. This contribution not only advances DMNMT but also adds a new dimension to future multimedia/multimodal research through its fresh take on cross-modal fusion.
Supplementary Material: zip
Submission Number: 4493
Loading