Enhancing Vision-Language Models for Global Cultural Understanding through Semantic Expansion and Diversity Reranking

Published: 06 May 2025, Last Modified: 29 May 2025VLMs4All 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Vision-Language Models, Semantic Expansion, Diversity Optimization, Cultural Visual Grounding, External Knowledge Bases
Abstract: Current Vision-Language Models (VLMs) often lack sufficient understanding and representation of cultural diversity in globalized scenarios. To bridge this gap, we propose Semantic Expansion and Diversity Optimization (SEDO), an innovative method that leverages external knowledge bases for semantic enrichment, employs diversity-aware reranking, and uses Segment Anything Model (SAM) for precise localization refinement. Using the GlobalRG benchmark, SEDO significantly improves retrieval relevance (88%) and cultural diversity (79.75%), achieving an Intersection over Union (IoU) of 0.7012 in visual cultural grounding tasks. Comprehensive experiments confirm the effectiveness of each proposed component, underscoring robust performance and generalization across diverse cultural contexts. Our work provides valuable guidance toward more inclusive and culturally aware vision-language models.
Submission Number: 5
Loading