[Proposal-ML] Enhancing Large Multi-Modal Auto-Regressive Models with Condition Contrastive Alignment
Keywords: image generation, video generation, alignment
TL;DR: Enhancing Large Multimodal Autoregressive Models with Condition Contrastive Alignment
Abstract: The rapid development of auto-regressive (AR) models in multi-modal generation has brought promising advancements, enabling coherent text, image, and video generation within a single framework. However, AR models still face significant challenges in practical application, especially in image generation where classifier-free guidance (CFG) is commonly used to enhance output quality. CFG, while effective, introduces substantial computational overhead and deviates from the simplicity of end-to-end auto-regressive generation. In this proposal, we aim to explore the potential of Condition Contrastive Alignment (CCA) within Emu3, a state-of-the-art multi-modal AR model, to address the reliance on CFG in image generation. By applying CCA, a recently proposed method for aligning AR models with target distributions through contrastive learning, we hypothesize that Emu3 can achieve comparable or superior output quality without CFG, reducing computational cost and improving generation efficiency. Our approach involves fine-tuning Emu3 with CCA on multi-modal data and conducting comprehensive evaluations across image and video generation benchmarks. This research will validate CCA’s applicability to large AR models, potentially advancing the field towards more efficient, unified multi-modal generation frameworks.
Submission Number: 48
Loading