Reflection-Bench: Evaluating Epistemic Agency in Large Language Models

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: Reflection-Bench is a cognitive psychology inspired, parameterized, and contamination-minimization benchmark that evaluates LLMs' intrinsic epistemic agency, the capacity to flexibly construct, adapt, and monitor beliefs about dynamic environments.
Abstract: With large language models (LLMs) increasingly deployed as cognitive engines for AI agents, the reliability and effectiveness critically hinge on their intrinsic epistemic agency, which remains understudied. Epistemic agency, the ability to flexibly construct, adapt, and monitor beliefs about dynamic environments, represents a base-model-level capacity independent of specific tools, modules, or applications. We characterize the holistic process underlying epistemic agency, which unfolds in seven interrelated dimensions: prediction, decision-making, perception, memory, counterfactual thinking, belief updating, and meta-reflection. Correspondingly, we propose Reflection-Bench, a cognitive-psychology-inspired benchmark consisting of seven tasks with long-term relevance and minimization of data leakage. Through a comprehensive evaluation of 16 models using three prompting strategies, we identify a clear three-tier performance hierarchy and significant limitations of current LLMs, particularly in meta-reflection capabilities. While state-of-the-art LLMs demonstrate rudimentary signs of epistemic agency, our findings suggest several promising research directions, including enhancing core cognitive functions, improving cross-functional coordination, and developing adaptive processing mechanisms. Our code and data are available at https://github.com/AI45Lab/ReflectionBench.
Lay Summary: As large language models (LLMs) become the "brains" of AI agents that interact with the real world, we need to understand how well LLMs can construct, adapt, and monitor their beliefs about changing environments - what we call "epistemic agency." This is a base-model-level characteristic that determines whether LLMs can truly serve as reliable cores of AI agents. However, current evaluations mainly focus on specific applications or isolated abilities. We proposed Reflection-Bench, a comprehensive benchmark that evaluates epistemic agency by breaking it down into seven key cognitive capabilities: predicting what will happen next, making decisions, noticing surprises, remembering past events, considering "what if" scenarios, updating beliefs when wrong, and reflecting on overall patterns. To evaluate these capabilities, we adapted seven cognitive tests used to study human cognition. Importantly, we designed these tests with adjustable parameters to prevent LLMs from simply memorizing answers. We tested 16 major LLMs discovered that while top models showed basic epistemic agency, all models struggled with recognizing global patterns across multiple experiences. These findings not only help us understand the capabilities and limitations of current LLMs, but also provide insights for enhancing LLMs' epistemic agency toward more reliable AI agents.
Link To Code: https://github.com/AI45Lab/ReflectionBench
Primary Area: General Machine Learning->Evaluation
Keywords: large language models, autonomous agent, cognitive psychology
Submission Number: 15131
Loading