Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

Cognitive Chain-of-Thought: Structured Multimodal Reasoning about Social Situations

ACL ARR 2025 July Submission129 Authors

23 Jul 2025 (modified: 02 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Chain-of-Thought (CoT) prompting helps models think step by step. But what happens when they must see, understand, and judge—all at once? In visual tasks grounded in social context, where bridging perception with norm-grounded judgments is essential, flat CoT often breaks down. We introduce Cognitive Chain-of-Thought (CoCoT), a prompting strategy that scaffolds VLM reasoning through three cognitively inspired stages: perception, situation, and norm. Our experiments show that, across multiple multimodal benchmarks—including intent disambiguation, commonsense reasoning, and safety, CoCoT consistently outperforms CoT and direct prompting (+8% on average). Our findings demonstrate that cognitively grounded reasoning stages enhance interpretability and social awareness in VLMs, paving the way for safer and more reliable multimodal systems.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: vision language navigation, cross modal content generation, multimodality

Contribution Types: Model analysis & interpretability, NLP engineering experiment

Languages Studied: English

Submission Number: 129

Loading