Abstract: For AI systems to be safely and reliably grounded in the real world, they should possess the ability of physical commonsense reasoning, i.e. they are desired to understand the physical properties, affordances, and maneuverability of objects in everyday life. Physical commonsense reasoning is essentially a multisensory task as physical properties of objects are manifested through multiple perception modalities, including both visual and auditory. In this study, we constructed two new benchmarks, called PACS-Reason and PACS-Reason+, for explainable physical audiovisual commonsense reasoning (EPACS), in which each datapoint is accompanied by a golden detailed rationale (intermediate reasoning path) to explain the answer selection. Moreover, we present PAVC-Reasoner, a multimodal large language model (LLM) designed to reason about physical commonsense attributes. The model aligns different modalities with the language modality by integrating three different perceivers for cross-modal pretraining and instruction finetuning at multiple granularities. It utilizes an LLM as a cognitive engine to process multimodal inputs and output convincing intermediate reasoning paths as justification for inferring answers. Numerous experiments have demonstrated the effectiveness and superiority of PAVC-Reasoner as a baseline model for studying EPACS. Most attractively, PAVC-Reasoner is capable of reasoning and obtaining strong interpretable explicit reasoning paths, signifying a significant stride towards real-world physical commonsense reasoning.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Engagement] Summarization, Analytics, and Storytelling
Relevance To Conference: This work contributes significantly to the advancement of multimodal processing by introducing and demonstrating the efficacy of PAVC-Reasoner, a pioneering multimodal foundational model designed for audiovisual physical commonsense reasoning. It addresses a critical gap in current AI systems' ability to understand the physical world by integrating and reasoning across visual and auditory modalities. Through the creation of PACS-Reason and PACS-Reason+, two benchmarks with detailed rationales for every video-question pair, this study provides a comprehensive framework for evaluating AI systems' physical commonsense reasoning capabilities. The PAVC-Reasoner model, with its unique architecture that combines three perception branches and an LLM for handling multimodal inputs, exemplifies the potential of multimodal AI systems. By achieving alignment between visual, auditory, and language modalities through cross-modal pretraining and multimodal instruction tuning, PAVC-Reasoner represents a significant leap forward in enabling AI systems to perceive and interpret complex physical environments in a manner akin to human sensory integration. This advancement in multimodal processing is essential for the development of AI applications that can safely and effectively operate in the physical world, marking a critical step towards achieving robust and intelligent multimodal AI systems.
Supplementary Material: zip
Submission Number: 2407
Loading