Keywords: language models, multi-turn reasoning, agentic reasoning, internal world model
TL;DR: Our method quantifies how language models update internal beliefs across multi-turn interactions—tested on “Twenty Questions”—showing base Qwen3 models struggle, while fine-tuning enables coherent belief updates, and exposes RL reward-hacking.
Abstract: To effectively perform open-ended tasks, language models must identify gaps in their knowledge, take actions to acquire new information, and update their internal world models accordingly. This raises a key question, how can we assess whether their reasoning chains and multi-turn actions contribute to improving beliefs in their internal world model? In this paper, we demonstrate a simple, scalable method of measuring belief updates by sequentially assessing the log-probabilities that a language model assigns to the true belief across multi-turn actions. We assess model belief updates on a multi-turn RL reasoning benchmark, 'Twenty Questions'. Our findings show that recent Qwen3 models struggle to update their beliefs, even when the quality of generated questions is controlled for. Through counterfactual experiments, we validate that finetuning teaches the student models to perform coherent belief updates, which they could not do before. Intriguingly, we find that measuring model beliefs also allows detecting reward-hacking in RL-trained models. Overall, we offer a novel perspective on measuring and understanding intermediate beliefs of language models.
Submission Number: 45
Loading