Lumina-mGPT: Illuminate Flexible Photorealistic Text-to-Image Generation with Multimodal Generative Pretraining
We present Lumina-mGPT, a family of multimodal autoregressive models capable of various vision and language tasks, particularly excelling in generating flexible photorealistic images from text descriptions. By initializing from multimodal Generative PreTraining (mGPT), Lumina-mGPT demonstrates that decoder-only Autoregressive (AR) model can achieve image generation performance comparable to modern diffusion model with high efficiency through Flexible Progressive Supervised Finetuning (FP-SFT). Equipped with our proposed Unambiguous image Representation} (Uni-Rep), Lumina-mGPT can flexibly generate high-quality images of varying aspect ratios. Building on the strong image generation capabilities, we further explore Ominiponent Supervised Finetuning (Omni-SFT), an initial attempt to elevate Lumina-mGPT into a unified multi-modal generalist. The resulting model demonstrates versatile multimodal capabilities, including visual generation tasks like text-to-image/multiview generation and controllable generation, visual recognition tasks like segmentation and depth estimation, and vision-language tasks like multi-turn visual question answering, casting light on the rosy potential of this direction. We release all code and checkpoints, hoping to facilitate the progress toward building artificial general intelligence.