Track: long paper (up to 10 pages)
Keywords: reasoning, test time scaling, verifiers
Abstract: We present a test-time verification framework, interwhen, that ensures that the output of a reasoning
model is valid wrt. a given set of verifiers. Verified reasoning is an important goal in high-stakes
scenarios such as deploying agents in the physical world or in domains such as law and finance.
However, current techniques either rely on the generate-test paradigm that verifies only after the
final answer is produced, or verify partial output through a step-extraction paradigm where the task
execution is externally broken down into structured steps. The former is inefficient while the latter
artificially restricts a model’s problem solving strategies. Instead, we propose to verify a model’s
reasoning trace as-is, taking full advantage of a model’s reasoning capabilities while verifying and
steering the model’s output only when needed. The key idea is meta-prompting, identifying the
verifiable properties that any partial solution should satisfy and then prompting the model to follow a
custom format in its trace such that partial outputs can be easily parsed and checked. We consider
both self-verification and external verification and find that interwhen provides a useful abstraction
to provide feedback and steer reasoning models in each case. Using self-verification, interwhen
obtains state-of-the-art results on early stopping reasoning models, without any loss in accuracy.
Using external verifiers, interwhen obtains reasonable improvement in accuracy over test-time scaling
methods, while ensuring 100% soundness with respect to full verifier and being 4x more efficient.
Can find the arxiv version of the paper : https://arxiv.org/pdf/2602.11202
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Funding: Yes, the presenting author of this submission falls under ICLR’s funding aims, and funding would significantly impact their ability to attend the workshop in person.
Submission Number: 76
Loading