Track: long paper (up to 10 pages)
Keywords: Multi-agent LLMs, logical reasoning, LLM-as-judge, confidence estimation, token-level probabilities, reasoning evaluation, uncertainty quantification
TL;DR: Early-generation confidence signals provide a strong and lightweight predictor of logical reasoning quality in multi-agent LLM systems.
Abstract: Assessing the logical reasoning quality of large language models (LLMs) remains a central challenge, particularly in open-ended settings where reference answers are unavailable. In this work, we investigate whether intrinsic confidence signals, specifically token-level log-probabilities produced during decoding, can serve as reliable indicators of reasoning quality in multi-agent LLM systems. We study this question in a debate-based framework, where agents generate competing arguments and are evaluated by an LLM-as-judge along structured reasoning dimensions. We find that early-generation confidence signals, especially dispersion within the first few tokens, consistently outperform full-sequence statistics in predicting judged reasoning quality. Trajectory analysis reveals that the opening phase of generation is the most heterogeneous, making it the most informative region for distinguishing high- and low-quality reasoning. We further identify a systematic asymmetry between reasoning roles: confidence aligns more strongly with supportive reasoning than with adversarial critique, reflecting differences in their underlying failure modes. These findings suggest that early decoding dynamics provide a lightweight and scalable signal for estimating reasoning reliability, offering a bridge between intrinsic model uncertainty and external evaluation of logical reasoning in LLMs.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Submission Number: 197
Loading