Unifying Understanding and Generation in Vision-Language Models: Advances, Challenges, and Opportunities

TMLR Paper7328 Authors

04 Feb 2026 (modified: 15 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Significant advancements in vision-language models have predominantly followed two divergent trajectories: autoregressive architectures optimized for visual understanding and diffusion-based frameworks designed for high-fidelity generation. However, this separation hinders the development of truly versatile multimodal agents. Unifying these capabilities is a critical step toward Artificial General Intelligence, as recent findings suggest that effective understanding and generation can mutually reinforce each other. This survey provides a comprehensive overview of the emerging field of unified vision-language models and proposes a systematic taxonomy based on the core visual representation mechanism: \textit{continuous} versus \textit{discrete} visual tokens. For continuous visual tokens, we analyze how models bridge the semantic-visual gap by categorizing integration strategies into Serial Coupling, where LLMs act as planners, and Parallel Coupling, which enables bidirectional interaction. regarding discrete visual tokens, we contrast Autoregressive approaches that treat images as a foreign language against emerging Discrete Diffusion paradigms known for their global consistency and parallel decoding. Beyond architectural analysis, we provide a curated compilation of datasets and benchmarks essential for training and evaluation. Finally, we critically discuss open challenges such as tokenization trade-offs, training stability, and scalability, while outlining future directions for building seamless, omni-capable multimodal systems.
Submission Type: Long submission (more than 12 pages of main content)
Assigned Action Editor: ~Yu-Xiong_Wang1
Submission Number: 7328
Loading