Estimating the Empowerment of Language Model Agents

Jinyeop Song; Jeff Gore; Max Kleiman-Weiner

Estimating the Empowerment of Language Model Agents

Jinyeop Song, Jeff Gore, Max Kleiman-Weiner

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Language models, Agents, Empowerment, Evaluation, Multi-turn interactions

TL;DR: EELMA is a scalable, task-agnostic framework that estimates empowerment from multi-turn interactions to evaluate LM agents, correlating with task success, capturing agentic factors, and highlighting key states without human labels

Abstract: As language model (LM) agents become more intelligent and gain broader access to real-world tools, there is a growing need for scalable evaluation frameworks of agentic capability. However, most evaluations of LM agents are goal-centric, measuring success on human-specified tasks or end states. Such evaluations are costly to design and fail to capture general capability. In this work, we propose an information-theoretic framework based on \emph{empowerment}, the mutual information between an agent’s actions and future states, as a principled metric for the evaluation of LM agents. We introduce \textbf{EELMA (Estimating Empowerment of Language Model Agents)}\footnote{\faIcon{github}~\textbf{Code for EELMA:} \url{https://anonymous.4open.science/r/EELMA-E227}}, an algorithm that estimates effective empowerment from multi-turn text interactions. Applying EELMA to language games (text-based Gridworld and Tower of Hanoi) and realistic web-browsing scenarios, we observe three key findings: (i) empowerment strongly correlates with average task performance, (ii) environmental complexity and agentic factors—including chain-of-thought prompting, model scale, and memory length—systematically affect empowerment and performance, and (iii) empowerment traces highlight influential states and actions without requiring human annotation. Together, these results demonstrate empowerment as a practical, general-purpose metric for evaluating LM agents and monitoring their behavior in complex, open-ended settings.

Submission Number: 82

Loading