Scaling Multimodal Theory-of-Mind with Weak-to-Strong Bayesian Reasoning

Chunhui Zhang; Sean Dae Houlihan; Kwonjoon Lee; Nakul Agarwal; Zhongyu Ouyang; Soroush Vosoughi; Shao-Yuan Lo

Scaling Multimodal Theory-of-Mind with Weak-to-Strong Bayesian Reasoning

Chunhui Zhang, Sean Dae Houlihan, Kwonjoon Lee, Nakul Agarwal, Zhongyu Ouyang, Soroush Vosoughi, Shao-Yuan Lo

24 Sept 2024 (modified: 05 Feb 2025)Submitted to ICLR 2025EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Neuro-symbolic concept, logic and forming reasoning, theory-of-mind

Abstract: Theory of Mind (ToM) enables individuals to understand and predict thoughts, emotions, and intentions of the others. To replicate this cognitive ability in machines, especially under complex multimodal environments, recent advances combine Bayesian-based state inference with deep learning models to estimate mental states, where the Bayesian model handles state transitions and a language model (LM) estimates the likelihood of intermediate states. However, while post-training an LM to specialise in ToM tasks improves performance, the computational cost increases as the LM scales, limiting the model size to 7 billion parameters. Despite this post-training process, smaller LMs still struggle with the physical and mental modelling demands of ToM due to their limited world knowledge and reasoning capacity. To address this, we propose a scalable solution that leverages the strengths of larger LMs (up to 70 and 405 billion parameters, respectively), including their vast world knowledge and atomic-level reasoning capabilities, without increasing post-training resource requirements. Our method transfers ToM-specific behaviours from a post-trained small LM to guide the latent reasoning of a larger LM during test time. This weak-to-strong control mechanism enables the larger LM to improve Bayesian likelihood estimation at each inference step, harnessing its reasoning power in ToM scenarios while reducing the need for additional training resources. Extensive experiments demonstrate the significant effectiveness of our scaled approach. It is better at inferring human mental states in complex and interactive environments, outperforming the state-of-the-art solution by $\sim4.6$% across multiple tasks on the multimodal ToM benchmark and unseen scenarios. Our code and datasets are available: https://anonymous.4open.science/r/scale-bayesian-tom-248B

Primary Area: neurosymbolic & hybrid AI systems (physics-informed, logic & formal reasoning, etc.)

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 3338

Loading