Supplementary Material: zip
Track: Proceedings Track
Keywords: cross-modal alignment, representational alignment, large language models, rectified flow, text-to-image generation, multimodal learning, generative modeling, image synthesis, semantic alignment, unified representation learning
TL;DR: This paper uses LLM priors with a dual-stream encoder and single-stream decoder to boost alignment and quality in image generation .
Abstract: Prior works have investigated the integration of large language models (LLMs) with rectified flow for image synthesis, but systematic studies of remain scarce. In this study, we examine how controlling the interaction between stochastic and semantic inputs during encoding, while integrating them during decoding, influences the alignment between noised latents and LLM hidden states. Our investigation shows that architectural refinements, such as dual-stream encoding and single-stream decoding, can accelerate training and improve image quality relative to LLM-adapted rectified flow baselines by enhancing representational similarity between text and visual domains. We evaluate our approach on standard image benchmarks and observe gains in both training speed and output detail preservation, indicating that structural choices in the integration of LLM features matter for cross-modal representational alignment in generative modeling. Beyond empirical improvements, our findings contribute to understanding how foundation models trained on text can develop representations that align with visual domains, revealing insights into the emergence of similar representational structures across distinct modalities. These results highlight a promising direction at the intersection of LLMs, rectified flow, and cross-modal representational analysis and motivate further explorations into unified representation learning.
Submission Number: 150
Loading