TL;DR: This work fully explores the use of language models for image generation, analyzing their optimization behavior, investigating tokenization, sampling strategies, and model scalability to achieve optimal performance.
Abstract: The success of large language models (LLMs) in text generation has inspired their application to image generation. However, existing methods either rely on specialized designs with inductive biases or adopt LLMs without fully exploring their potential in vision tasks. In this work, we systematically investigate the design space of LLMs for image generation and demonstrate that LLMs can achieve near state-of-the-art performance without domain-specific designs, simply by making proper choices in tokenization methods, modeling approaches, scan patterns, vocabulary design, and sampling strategies. We further analyze autoregressive models' learning and scaling behavior, revealing how larger models effectively capture more useful information than the smaller ones. Additionally, we explore the inherent differences between text and image modalities, highlighting the potential of LLMs across domains. The exploration provides valuable insights to inspire more effective designs when applying LLMs to other domains. With extensive experiments, our proposed model, **ELM** achieves an FID of 1.54 on 256$\times$256 ImageNet and an FID of 3.29 on 512$\times$512 ImageNet, demonstrating the powerful generative potential of LLMs in vision tasks.
Lay Summary: Large language models (LLMs) have achieved remarkable success in text generation, motivating researchers to explore their potential for image generation. However, most existing approaches either rely on custom model designs with vision-specific biases or apply LLMs directly without fully exploring their potential in vision tasks.
In this work, we systematically examine how to best repurpose LLMs for image generation by investigating fundamental design choices, including tokenization, modeling strategies, scan patterns, vocabulary construction, and sampling techniques. Through comprehensive analysis and experiments, we show that LLMs — without any domain-specific architectural changes — can achieve state-of-the-art image generation quality when these components are carefully selected.
We also study how model size affects learning in this setting, revealing that larger LLMs capture more useful visual patterns and require less randomness during sampling. Additionally, we compare the intrinsic differences between language and images, providing practical insights for adapting autoregressive language models to other non-text domains.
Our work demonstrates that general-purpose LLMs, with thoughtful design, can serve as powerful image generators, bridging modality boundaries and informing future multi-domain generative model research.
Link To Code: https://github.com/Pepper-lll/LMforImageGeneration
Primary Area: Applications->Computer Vision
Keywords: Image generation, Large language model, Generative model
Submission Number: 15791
Loading