Abstract: Generative pre-training has significantly advanced natural language understanding. Building upon this success, recent research begins to innovate Large Vision Models (LVM) by leveraging large-scale pre-training on visual sequences, where simultaneous consideration of image token sequences within single images and across a set of images is of key importance. This paper shows that sequential modeling on single images and across multiple images can be efficiently and effectively decoupled. We introduce a two-stage learning pipeline, starting with single-image pre-training, followed by fine-tuning on long image/video sequences. We term this method Large Vision Model Lite (LVM-Lite). Extensive experiments showcase the impressive performance of LVM-Lite across various generative and discriminative benchmarks, comparable to specifically trained models without the need for task-specific training. Importantly, LVM-Lite accelerates training speed substantially up to $2.7\times$ and demonstrates strong scalability.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Gabriel_Loaiza-Ganem1
Submission Number: 3359
Loading