Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation

Published: 28 Sept 2025, Last Modified: 14 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, User–Agent–Environment Simulation, Agent Evaluation Framework
Abstract: Large language models (LLMs) are transitioning from chatbot to interactive agents. In this shift, environments have become critical for both evaluating their performance and improving their capabilities. Yet current options are not good enough: human-in-the-loop testing is prohibitively costly, and available benchmarks and simulation framework oversimplify interactions, failing to capture real-world complexity. This paper presents a fully automated framework for simulating User–Agent–Environment interactions, providing scalable and faithful interaction data for agent evaluation. The framework works by: (1) constructing multi-step tasks by sampling from a Tool–Relationship Graph, (2) simulating closed-loop conversations with configurable user and environment archetypes, (3) evaluating outcomes with Procedural Alignment (Procedure Alignment Score), end-to-end success (Outcome Success), and simulation faithfulness (Configuration Similarity). We apply this framework to evaluate state-of-the-art open- and closed-source agents. Experiments across thousands of scenarios reveal three key findings: (i) environment reliability is the dominant factor in agent success, (ii) user archetypes strongly shape performance, and (iii) tool-calling trace fidelity correlates with—but does not fully determine—end-to-end goal achievement. By integrating User, Agent, and Environment in a unified loop, and embedding flexibility with explicit faithfulness control, our framework provides a principled basis for evaluating and improving agentic LLMs under diverse conditions.
Archival Option: The authors of this submission do *not* want it to appear in the archival proceedings.
Submission Number: 90
Loading