Keywords: Vision-Language Models, Explainability, Multimodal Reasoning, Prompt Engineering, Symbolic Reasoning, Lateral Thinking, Evaluation Framework
Abstract: Vision-Language Models (VLMs) demonstrate impressive capabilities on multimodal tasks, yet their reasoning processes remain opaque, particularly for complex lateral thinking challenges. While recent work has demonstrated VLMs struggle significantly with rebus puzzle solving, the underlying reasoning processes and failure modes remain largely unexplored. We address this gap through a comprehensive explainability analysis that moves beyond performance metrics to understand how and why VLMs approach these complex lateral thinking challenges. Our study contributes a systematically annotated dataset of 221 rebus puzzles across six cognitive categories, paired with an evaluation framework that separates reasoning quality from answer correctness. We investigate three distinct prompting strategies designed to elicit different types of explanatory reasoning and reveal critical insights into VLM cognitive processes. Our findings demonstrate that explanation quality varies dramatically across puzzle categories, with models showing systematic reasoning strengths in visual composition while exhibiting fundamental limitations in absence reasoning and cultural symbolism. We also discover that prompting strategy substantially influences both reasoning transparency and problem-solving effectiveness, establishing explainability as an integral component of model performance rather than a post-hoc consideration.
Paper Published: No
Paper Category: Short Paper
Demography: Women in ML/NLP, Others
Demography Other: First-generation student and independent researcher
Academic: Others
Academic Other: Graduated with Masters last semester
Submission Number: 36
Loading