Keywords: Visual-Language Models, Cultural Awareness, Cultural Domains, Prompt Engineering, Few-Shot Prompting, Multimodal Prompting
Abstract: Visual-Language Models (VLMs) have demonstrated significant capabilities in multimodal understanding, yet their awareness of diverse cultural contexts remains a critical area for evaluation. Standard benchmarks often fall short in assessing culturally specific knowledge. This work investigates the impact of prompt engineering strategies on VLM performance, particularly focusing on techniques relevant to evaluating nuanced understanding, potentially including cultural awareness. We compare zero-shot performance with few-shot prompting using both text-only and multimodal (image-text) examples on the CulturalVQA benchmark. Our findings indicate that few-shot prompting leads to a notable improvement over the zero-shot baseline. Text-based few-shot prompts show a clear increase in performance, while multimodal few-shot prompts that incorporate both text and images achieve the best results. These outcomes underscore the power of few-shot prompting—especially with multimodal examples—in enhancing VLM performance on tasks requiring specific contextual understanding and suggest that prompt engineering is a valuable tool for probing and improving the capabilities of VLMs in specialized domains, including cultural contexts addressed by benchmarks like CulturalVQABench.
Submission Number: 8
Loading