Keywords: large language models, cognition, probabilistic reasoning, active sampling, policy learning, decision-making, in-context learning
Abstract: Can large language models (LLMs), when acting as agents, match human cognitive capabilities in sequential reasoning? To answer this question, we designed a novel active probabilistic reasoning task that can be played by humans and LLMs. Our minimal task design allows us to disentangle two essential components of decision-making, sampling (gathering evidence) and inference (evaluating evidence). We evaluated a large set of LLMs and find a wide spectrum of performance. Several frontier models reach human-level performance, but do not exceed skilled human players. Strong model performance consistently relies on extensive reasoning. While some LLMs outperform humans in inference, all models consistently lag in sampling capabilities. To probe the source of these differences, we develop a novel Bayesian modeling framework that tracks sampling-policy updates and maps humans and LLMs to different classical observer models. We show that humans tend toward maximum-a-posteriori (MAP) sampling, whereas the best LLMs tend to minimize posterior entropy across options. We further tested whether LLMs can improve via in-context learning, and found that only a subset of top-performing models could learn to solve the task based only on the outcome of their choices.
Primary Area: applications to neuroscience & cognitive science
Submission Number: 21957
Loading