OmniVIVO: Towards Unified Multimodal Generative Modeling for Simultaneous Language-Guided Speech and Image Synthesis

ICLR 2026 Conference Submission15354 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Generation, High-Fidelity Image Generation, Text-to-Speech
TL;DR: OmniVIVO is a unified large language model that simultaneously generates high-fidelity images and natural speech from a single text input.
Abstract: Recent large language models (LLMs) based on autoregressive (AR) next-token prediction have achieved remarkable success in natural language generation and are rapidly expanding to image and speech synthesis. Yet most current approaches still treat these modalities in isolation—training independent models or loosely coupling multiple generators. Even recent omni-models such as UGen and Qwen2.5-Omni mainly address understanding tasks or text–image generation and do not provide a single AR backbone capable of simultaneously producing high-fidelity images and natural speech. Inspired by the human brain’s capability to imagine and speak simultaneously, we propose OmniVIVO, a unified AR approach for modeling visual and voice modalities together, capable of generating high-fidelity images and natural speech in parallel from a single text input. Our OmniVIVO integrates a state-of-the-art AR image generator with a novel lightweight speech decoder, enabling the first unified approach for the concurrent generation of natural speech and high-fidelity images. By sharing representations across modalities within a single transformer backbone, the model learns a rich multimodal space that enables tighter semantic alignment and more efficient joint generation than existing multi-model pipelines. Through a unified backbone, OmniVIVO produces speech with high perceptual quality and naturalness, surpassing comparably sized text-to-speech (TTS) systems and being on par with state-of-the-art systems like Cosyvoice2 and VITS, while maintaining high-fidelity image generation. To quantify contextual understanding across modalities, we propose a multimodal ranking metric spanning text, speech, and images, demonstrating that OmniVIVO’s bi-modal outputs are effective in information acquisition. We construct VIVOGen, a high-quality tri-modal text–image–speech dataset that leverages OmniVIVO’s multimodal outputs, providing a valuable resource for research in multimodal learning and applications in education and language acquisition, which we will publicly release.
Supplementary Material: zip
Primary Area: generative models
Submission Number: 15354
Loading