MCD-RRG: Time-Varying Multimodal Fusion and Residual Retrieval Guidance for Conditional Diffusion
Keywords: Diffusion models, Multimodal generation, Dynamic modality fusion, Retrieval-based guidance, Image synthesis
Abstract: Multimodal conditional diffusion combines text with structural signals such as sketches and depth, but many practical pipelines still use fixed fusion rules and retrieval procedures that do not adapt across denoising time. We study whether a learned timestep-dependent fusion policy and lightweight residual retrieval can improve the semantic-structural trade-off under a fixed backbone. We propose MCD-RRG, which couples a sample-aware $\alpha$-gating network for time-varying multimodal fusion with a residual retrieval guidance (RRG) module that applies small latent corrections from a train-split FAISS-HNSW bank. Under a unified COCO-100 protocol, Sigmoid+renorm gating improves over static equal-weight fusion from 31.2 to 26.1 FID, from 0.204 to 0.176 LPIPS, and from 0.681 to 0.735 structural score. Relative to disabling retrieval, step-wise RRG improves FID from 29.8 to 26.1 and structural score from 0.671 to 0.735 with a 4.5% latency increase. On LAION-Depth-5k, a strict zero-shot protocol using a COCO-trained bank improves FID from 25.1 to 23.7 and structural score from 0.703 to 0.724 while keeping the near-duplicate rate at 0.7%. These results indicate that learned time-varying fusion and lightweight step-wise retrieval are complementary tools for stabilizing multimodal diffusion.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 37
Loading