Vision Language Models See What You Want but not What You See

Published: 06 Mar 2025, Last Modified: 05 May 2025ICLR 2025 Bi-Align Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision language models; perspective-taking; intentionality understanding; theory-of-mind; cognitive AI
TL;DR: Vision Language Models See What You Want but not What You See
Abstract: Knowing others' intentions and taking others' perspectives are two core components of human intelligence that are considered to be instantiations of theory-of-mind. Infiltrating machines with these abilities is an important step towards building human-level artificial intelligence. Here, to investigate intentionality understanding and level-2 perspective-taking in Vision Language Models (VLMs), we constructed the IntentBench and PerspectBench, which together contains over 300 cognitive experiments grounded in real-world scenarios and classic cognitive tasks. We found VLMs achieving high performance on intentionality understanding but low performance on level-2 perspective-taking. This suggests a potential dissociation between simulation-based and theory-based theory-of-mind abilities in VLMs, highlighting the concern that they are not capable of using model-based reasoning to infer others' mental states.
Submission Type: Long Paper (9 Pages)
Archival Option: This is an archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 73
Loading