TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

TACO: Think-Answer Consistency for Optimized Long-Chain Reasoning and Efficient Data Learning via Reinforcement Learning in LVLMs

ICLR 2026 Conference Submission15578 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Reinforcement Learning, Visual Reasoning, Referring Expression Comprehension, Visual Question Answering

Abstract: The paradigm for training Large Vision-Language Models (LVLMs) is evolving toward autonomous problem-solving, revealing critical instabilities in complex visual reasoning. We identify three failure modes: exploration collapse, inefficient learning, and—most notably—ineffective reasoning, marked by logical inconsistencies between reasoning traces and outputs. To mitigate these, we introduce TACO, a reinforcement learning framework that enforces multi-level consistency. TACO comprises three integrated components: a Think-Answer Consistency (TAC) reward ensuring joint alignment of reasoning, answer, and ground truth for semantic integrity; a Memory-Guided KL Stabilization (MKS) mechanism that dynamically defers high-risk updates to prevent optimization collapse; and an Adaptive Difficulty Sampling (ADS) module that optimizes data curation for efficient learning. Extensive experiments validate TACO's superiority, achieving top performance on 15 benchmarks spanning Referring Expression Comprehension (REC), Visual Question Answering (VQA), and long-horizon Video VQA. TACO exhibits enhanced generalization, sustained efficiency, and stability in long-chain reasoning, outperforming conventional RL approaches.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 15578

Loading