VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Zeren Xiong; Yue Yu; Ze-dong Zhang; Shuo Chen; Jian Yang; Jun Li

VMDiff: Visual Mixing Diffusion for Limitless Cross-Object Synthesis

Zeren Xiong, Yue Yu, Ze-dong Zhang, Shuo Chen, Jian Yang, Jun Li

Published: 26 Jan 2026, Last Modified: 17 May 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Image generation, Image concept fusion

Abstract: Creating novel images by fusing visual cues from multiple sources is a fundamental yet underexplored problem in image-to-image generation, with broad applications in artistic creation, virtual reality and visual media. Existing methods often face two key challenges: coexistent generation, where multiple objects are simply juxtaposed without true integration, and bias generation, where one object dominates the output due to semantic imbalance. To address these issues, we propose **Visual Mixing Diffusion (VMDiff)**, a simple yet effective diffusion-based framework that synthesizes a single, coherent object by integrating two input images at both noise and latent levels. Our approach comprises: (1) a **hybrid sampling process** that combines guided denoising, inversion, and spherical interpolation with adjustable parameters to achieve structure-aware fusion, mitigating coexistent generation; and (2) an **efficient adaptive adjustment module**, which introduces a novel similarity-based score to automatically and adaptively search for optimal parameters, countering semantic bias. Experiments on a curated benchmark of 780 concept pairs demonstrate that our method outperforms strong baselines in visual quality, semantic consistency, and human-rated creativity.

Supplementary Material: pdf

Primary Area: generative models

Submission Number: 6942

Loading