LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process

Huiyang Shao; Xin Xia; Yuxi Ren; XING WANG; Xuefeng Xiao

LABridge: Text–Image Latent Alignment Framework via Mean-Conditioned OU Process

Huiyang Shao, Xin Xia, Yuxi Ren, XING WANG, Xuefeng Xiao

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY-NC-ND 4.0

Keywords: diffusion, alignment, OU Process

Abstract: Diffusion models have emerged as state‑of‑the‑art in image synthesis.However, it often suffer from semantic instability and slow iterative denoising. We introduce Latent Alignment Framework (LABridge), a novel Text–Image Latent Alignment Framework via an Ornstein–Uhlenbeck (OU) Process, which explicitly preserves and aligns textual and visual semantics in an aligned latent space. LABridge employs a Text-Image Alignment Encoder (TIAE) to encode text prompts into structured priors that are directly aligned with image latents. Instead of a homogeneous Gaussian, Mean-Conditioned OU process smoothly interpolates between these text‑conditioned priors and image latents, improving stability and reducing the number of denoising steps. Extensive experiments on standard text-to-image benchmarks show that LABridge achieves better text–image alignment metric and competitive FID scores compared to leading diffusion baselines. By unifying text and image representations through principled latent alignment, LABridge paves the way for more efficient, semantically consistent, and high‑fidelity text to image generation.

Supplementary Material: zip

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 10911

Loading