Test-Time Scaling in Reasoning Models Is Not Effective for Knowledge-Intensive Tasks Yet

Published: 02 Mar 2026, Last Modified: 30 Mar 2026Agentic AI in the Wild: From Hallucinations to Reliable Autonomy PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: test-time scaling, reasoning models, factuality hallucination, information-theoretic analysis
TL;DR: Increasing test-time computation does not consistently improve factual accuracy or reduce hallucinations in reasoning models.
Abstract: Test-time scaling increases inference-time computation by enabling longer reasoning chains and has shown strong performance gains across many domains. However, frontier models still suffer from hallucinations on knowledge-intensive tasks, raising the question of whether increasing test-time computation is effective in this setting. In this work, we evaluate 14 reasoning models under different test-time scaling strategies on parametric knowledge benchmarks. Our results challenge its effectiveness: increasing test-time computation does not consistently improve accuracy and often leads to more hallucinations. We find that changes in hallucination rates are largely driven by the model's willingness to answer, as longer reasoning encourages more attempts, many of which are incorrect. Extended reasoning can also induce confirmation bias, where models reinforce early incorrect beliefs with fabricated details, resulting in overconfident hallucinations. Finally, we provide an information-theoretic perspective showing that compute-only test-time scaling, as a post-processing step on a fixed model, cannot increase information about the ground-truth answer. Overall, our findings highlight fundamental limitations of current test-time scaling methods for knowledge-intensive tasks.
Submission Number: 20
Loading