Keywords: MLLMs, Visual Grounding, TableQA, Visual Reasoning
Abstract: This paper investigates how explicit visual grounding within Chain-of-Thought (CoT) sequences impacts the reasoning proficiency of multimodal large language models (MLLMs). Using visual TableQA as a testbed, we examine this interplay through supervised fine-tuning (SFT) and reinforcement learning (RL) with a hierarchical grounding reward. Our analysis reveals an observable performance trade-off, where the requirement for rigid spatial syntax appears to interfere with the model’s internal reasoning heuristics. These insights suggest that aligning precise spatial anchoring with logical inference poses substantial challenges for current training regimes, highlighting the need for more sophisticated data synthesis and optimization strategies in complex multimodal tasks.
Paper Type: Short
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Research Area Keywords: MLLMs, Visual Grounding, TableQA, Visual Reasoning
Contribution Types: NLP engineering experiment, Data resources
Languages Studied: English
Submission Number: 3163
Loading