Advancing Vision-Language Models with Generative AI

Rahul Raja, Arpita Vats

Published: 25 Jan 2025, Last Modified: 08 Feb 2025OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Generative AI within large vision-language models (LVLMs) has revolutionized multimodal learning, enabling machines to understand and generate visual content from textual descriptions with unprecedented accuracy. This paper explores state-of-the-art advancements in LVLMs, focusing on prominent models such as CLIP for cross-modal retrieval, Flamingo for few-shot video understanding, BLIP for self-supervised learning, CoCa for integrating contrastive and generative learning, and X-CLIP for enhancing video-text retrieval. These models demonstrate the flexibility and scalability of LVLMs across a variety of applications. Through an evaluation based on metrics such as image generation quality, perceptual loss, and CLIP score, we provide insights into their capabilities, limitations, and opportunities for future enhancement. As generative AI continues to evolve, this analysis underscores the importance of developing scalable, efficient multimodal models capable of addressing real-world challenges with minimal fine-tuning.