TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs
Keywords: Reinforcement Learning, Visual Reasoning, Referring Expression Comprehension, Visual Question Answering
Abstract: The paradigm for training Large Vision-Language Models (LVLMs) is evolving toward autonomous problem-solving, revealing critical instabilities in complex visual reasoning. We identify three failure modes: exploration collapse, inefficient learning, and—most notably—ineffective reasoning, marked by logical inconsistencies between reasoning traces and outputs.
To mitigate these, we introduce TACO, a reinforcement learning framework that enforces multi-level consistency. TACO comprises three integrated components: a Think-Answer Consistency (TAC) reward ensuring joint alignment of reasoning, answer, and ground truth for semantic integrity; a Memory-Guided KL Stabilization (MKS) mechanism that dynamically defers high-risk updates to prevent optimization collapse; and an Adaptive Difficulty Sampling (ADS) module that optimizes data curation for efficient learning.
Extensive experiments validate TACO's superiority, achieving top performance on 15 benchmarks spanning Referring Expression Comprehension (REC), Visual Question Answering (VQA), and long-horizon Video VQA. TACO exhibits enhanced generalization, sustained efficiency, and stability in long-chain reasoning, outperforming conventional RL approaches.
Supplementary Material: zip
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 15578
Loading