Your Reasoning Model is Secretly a Reward Model - Training-Free Verification from Experience

Your Reasoning Model is Secretly a Reward Model - Training-Free Verification from Experience

ACL ARR 2026 January Submission2409 Authors

02 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM, Reward Model, Training-Free

Abstract: Assessing the quality of Large Language Model (LLM) outputs becomes especially challenging in high-branching settings, where a single prompt yields many plausible candidates. Existing verifiers typically operate on the surface text (e.g., reward models, LLM judges, majority voting) or on confidence proxies derived from token probabilities, both of which can be brittle: the former can be influenced by stylistic artifacts, while the latter is often miscalibrated. In this paper, we study a third source of information---the model's hidden states---for \emph{binary correctness verification} in tasks with a reliable success/failure signal (e.g., deterministic checkers or reference-grounded answers). We find that correct and incorrect solutions exhibit measurable geometric differences in their hidden-state trajectories. To isolate this signal with minimal modeling assumptions, we introduce \textbf{\textsc{Clue} (Clustering and Experience-based Verification)}, a training-free, non-parametric verifier. \textsc{Clue} summarizes each reasoning trace by an \emph{activation delta}---the difference between hidden states at the start and end of the explicit reasoning span---and predicts correctness by comparing this delta to two class centroids computed from labeled experience. Across math (AIME 24/25), scientific QA (GPQA), and a multi-domain benchmark (WebInstruct-verified), \textsc{Clue} improves selection and reranking, with particularly strong gains on smaller or less-calibrated models. For example, on AIME 24 with a 1.5B model, \textsc{Clue} raises accuracy from 56.7\% (majority@64) to 70.0\% (top-maj@16).

Paper Type: Long

Research Area: AI/LLM Agents

Research Area Keywords: Reward model, Verifier

Contribution Types: NLP engineering experiment, Approaches low compute settings-efficiency, Data analysis

Languages Studied: English

Submission Number: 2409

Loading