Abstract: Humans can infer the affordance of objects by extracting related contextual preconditions for each scenario. For instance, when presented with an image of a shattered cup, we can deduce that this specific condition hinders its suitability for drinking purposes. The process of employing commonsense preconditions for reasoning is extensively studied in NLP, where models explicitly acquire contextual preconditions in textual form. Nonetheless, it remains uncertain whether state-of-the-art visual language models (VLMs) can effectively extract such preconditions and employ them to infer object affordances. In this work, dubbed PRISM, we introduce two tasks: preconditioned visual language inference (PVLI) and rationalization (PVLR). To address these tasks, we propose three strategies for acquiring weak supervision signals and creating a human-validated evaluation resource through crowd-sourcing. Our findings expose the limitations of current state-of-the-art VLM models in these tasks, and we chart a roadmap for overcoming the challenges that lie ahead in their improvement.
Paper Type: long
Research Area: Resources and Evaluation
Contribution Types: Model analysis & interpretability, NLP engineering experiment, Data resources
Languages Studied: English
0 Replies
Loading