Idea Visual: Intent-Driven View Synthesis for Smart Mobile Devices via Retrieval-Augmented Diffusion Models

Published: 2025, Last Modified: 07 Jan 2026IEEE Trans. Consumer Electron. 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Although the generative novel view synthesis frameworks have already achieved the generation of target views from specific viewpoints, they still rely on either direct or indirect input of orientation conditions. These limit the model’s ability to understand and reason about users’ subjective intent when deployed in smart mobile devices. To address this issue, we propose a new task called intent-driven view synthesis (IDVS), and introduce Idea Visual, a fine-tuned multimodal retrieval-augmented 2D generative model specifically tailored for IDVS. The core component of our model is the multimodal intent-driven diffusion model (MIDDM), which inherits the generative capabilities of the viewpoint-based diffusion model while also integrating natural language understanding. This enables the model to retrieve and generate corresponding multi-view images based on user instructions with high accuracy. To train and evaluate Idea Visual, we construct a benchmark comprising over 1.6k categories, 13k view-instruction pairs. With this dataset, the model not only learns to retrieve user intent from natural language instructions but also unlocks its powerful view synthesis capabilities in complex scenes. Both quantitative and qualitative experiments demonstrate that our proposed model excels in achieving the objectives of the IDVS task, with the quality of generated images showing significant improvements over current state-of-the-art baselines. Further experimental results indicate that when provided with a set of omnidirectional descriptive instructions for a single object, the method can generate images with high 3D consistency, significantly enhancing user experience and consumer electronics product intelligence.
Loading