Keywords: probabilistic image generation, rectified flow, large language models, structured priors, dual-stream encoder, single-stream decoder, semantic alignment, text-to-image synthesis, generative modeling, CLIP similarity, FID, multimodal alignment
TL;DR: LLMs as structured priors with dual-stream encoding and single-stream decoding improve efficiency and quality in probabilistic image generation .
Abstract: Prior works have investigated the integration of large language models (LLMs) with rectified flow for image synthesis, but systematic studies remain scarce. In this study, we examine how controlling the interaction between stochastic and semantic inputs during encoding, while integrating them during decoding, influences the alignment between noised latents and LLM hidden states. Our investigation shows that architectural refinements, such as dual-stream encoding and single-stream decoding, can accelerate training and improve image quality relative to LLM-adapted rectified flow baselines. We evaluate our approach on standard image benchmarks and observe gains in both training speed and output detail preservation, indicating that structural choices in the integration of LLM features matter for probabilistic inference in generative modeling. Beyond empirical improvements, our findings contribute to understanding how foundation models trained on text can be adapted as structured probabilistic priors in visual domains. These results highlight a promising direction at the intersection of LLMs, rectified flow, and probabilistic image synthesis and motivate further explorations.
Submission Number: 71
Loading