Keywords: Speech Translation, Simultaneous Speech Translation, Multimodal Translation
Abstract: Open-source text-only translation Large Language Models (LLMs) are rapidly improving in multilingual coverage and translation quality, but their unimodal nature limits their use in multimodal translation. In speech translation (ST), they must operate in cascaded pipelines using automatic speech recognition followed by translation introducing additional latency, especially for simultaneous ST (SimulST). In both speech and caption translation, they cannot exploit visual context for disambiguation.
In contrast, pretrained multimodal foundation models (MMFMs) offer strong cross-modal perception and reasoning but lack the multilingual depth and translation performance of specialized translation LLMs. This mismatch motivates combining the strengths of both model types.
We propose OmniFusion, an end-to-end framework that fuses MMFMs with translation LLMs via a novel multi-layer hidden-state fusion strategy, enabling joint training while remaining data and resource efficient. Built with Omni 2.5–7B as the MMFM and SeedX PPO–7B as the translation LLM, OmniFusion handles speech-to-text, speech-and-image-to-text, and text-and-image-to-text translation within a single architecture. Experiments show that OmniFusion effectively leverages audio and visual cues, reduces SimulST latency by one second compared to cascaded pipelines, and improves overall translation quality.
Paper Type: Long
Research Area: Machine Translation
Research Area Keywords: Cross-Modal machine translation, Speech Translation, Efficient MT training
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: English, German, Chinese, Italian, Arabic, French, Russian, Czech
Submission Number: 3880
Loading