Abstract: As large language models increasingly shape modern machine intelligence, their extension into the visual domain has accelerated the development of Large Vision-Language Models (LVLMs). These systems have become central to multimodal applications. However, their growing heterogeneity underscores the need for a systematic, design-centric review to support next-generation models. Existing surveys often categorize LVLMs by a single criterion or offer coarse-grained architectural overviews. While this gives a quick snapshot of current advancements, it frequently overlooks unconventional yet promising designs. Moreover, these overviews underrepresent empirical findings across modules and offer limited guidance for selecting architectures aligned with specific requirements. To complement these perspectives, we organize a broad spectrum of studies to provide a comprehensive overview of the LVLM design space. We summarize design variants at a fine-grained modular level, consolidate ablation results and empirical evidence, and discuss the advantages and trade-offs associated with key design choices. Concretely, we conceptualize LVLMs as a three-stage pipeline—Representation, Modeling, and Generation—where each stage comprises a set of modules corresponding to distinct design dimensions. These dimensions govern how LVLMs perceive visual and linguistic inputs, integrate multimodal information, and generate coherent and controllable outputs. Overall, this survey organizes the expanding LVLM landscape into a coherent and navigable form that balances breadth and depth. By consolidating architectural choices with empirical insights, it supports principled system design and encourages future advances.
External IDs:doi:10.36227/techrxiv.176620829.92520878/v1
Loading