Keywords: Vision-Language Models (VLMs), Multimodal Reasoning, Chain-of-Thought, Reward Modeling, Human Inspired
Abstract: Recent advances in vision–language models (VLMs) have markedly improved image–text alignment, yet they still fall short of human-like visual reasoning. A key limitation is that many VLMs rely on surface correlations rather than building logically coherent structured representations, which often leads to missed higher-level semantic structure and non-causal relational understanding, hindering compositional and verifiable reasoning. To address these limitations by introducing human models into the reasoning process, we propose CoTZero, an annotation-free paradigm with two components: \textbf{(i) a dual-stage data synthesis approach and (ii) a cognition-aligned training method}. In the first component, we draw inspiration from neurocognitive accounts of \textit{compositional productivity} and \textit{global-to-local analysis}. In the bottom-up stage, CoTZero extracts atomic visual primitives and incrementally composes them into diverse, structured question–reasoning forms. In the top-down stage, it enforces hierarchical reasoning by using coarse global structure to guide the interpretation of local details and causal relations. In the cognition-aligned training component, built on the synthesized CoT data, we introduce \textbf{Cognitively Coherent Verifiable Rewards} (CCVR) in Reinforcement Fine-Tuning (RFT) to further strengthen VLMs' hierarchical reasoning and generalization, providing stepwise feedback on reasoning coherence and factual correctness. Experiments show that CoTZero achieves an F1 score of 83.33\% on our multi-level semantic inconsistency benchmark with lexical-perturbation negatives, across both in-domain and out-of-domain settings. Ablations confirm that each component contributes to more interpretable and human-aligned visual reasoning.
Paper Type: New Full Paper
Submission Number: 39
Loading