FOCOT: A Unified Framework of FOT for Multimodal VOCOT

ACL ARR 2026 March Submission21 Authors

02 Mar 2026 (modified: 07 Jun 2026)ACL ARR 2026 March SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multimodal Reasoning,Chain-of-Thought (CoT),Forest-of-Thought (FOT),Visual Grounding,Contrastive Decoding
Abstract: Multimodal Chain-of-Thought (CoT) improves interpretability and spatial grounding by fusing vision and language. However, it still reasons along a single trajectory, ignoring complementary information from different perspectives. Inspired by the success of Forest-of-Thought (FOT) in long-text reasoning, we reframe object-aware CoT as a multibranch, tree-structured exploration that allows the model to traverse diverse reasoning paths under complex vision-language scenarios and to converge toward an optimal thinking trajectory. Specifically, we expand textual and visual cues into distinct reasoning branches that capture complementary perspectives, and introduce node-level collaboration to enable cross-branch evidence exchange and soft aggregation of partial signals. To further curb hallucinations, we introduce node-level contrastive decoding, extending discriminative decoding to operate on the entire reasoning forest structure. Extensive experiments demonstrate that our framework, Forest-of-Chain-of-Thought (FOCOT), achieves superior performance across multiple benchmarks, significantly surpassing VOCOT, standard FOT, and other strong CoT baselines. Moreover, ablation studies further confirm the effectiveness of each component within our framework. The code is available in the supplementary material.
Paper Type: Long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: multimodality, vision question answering, chain-of-thought, contrastive learning, logical reasoning, representation learning
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 21
Loading