Exploring the Interplay Between Explicit Grounding and Reasoning in Visual TableQA

Exploring the Interplay Between Explicit Grounding and Reasoning in Visual TableQA

ACL ARR 2026 January Submission3163 Authors

04 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: MLLMs, Visual Grounding, TableQA, Visual Reasoning

Abstract: This paper investigates how explicit visual grounding within Chain-of-Thought (CoT) sequences impacts the reasoning proficiency of multimodal large language models (MLLMs). Using visual TableQA as a testbed, we examine this interplay through supervised fine-tuning (SFT) and reinforcement learning (RL) with a hierarchical grounding reward. Our analysis reveals an observable performance trade-off, where the requirement for rigid spatial syntax appears to interfere with the model’s internal reasoning heuristics. These insights suggest that aligning precise spatial anchoring with logical inference poses substantial challenges for current training regimes, highlighting the need for more sophisticated data synthesis and optimization strategies in complex multimodal tasks.

Paper Type: Short

Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond

Research Area Keywords: MLLMs, Visual Grounding, TableQA, Visual Reasoning

Contribution Types: NLP engineering experiment, Data resources

Languages Studied: English

Submission Number: 3163

Loading