Faithful Simulation of User–Agent–Environment Interactions for Scalable LLM Agent Evaluation

15 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, User–Agent–Environment Simulation, Agent Evaluation Framework
Abstract: Large language models (LLMs) are transitioning from chatbots to interactive agents. In this shift, understanding environments and user behavior has become critical not only for measuring capability, but for validating whether an end-to-end agent system remains robust as its tools and interfaces evolve. Yet current options are not good enough: human-in-the-loop testing is prohibitively costly, and available benchmarks and simulation framework oversimplify interactions, failing to capture real-world complexity. This paper presents FUSE, a fully automated framework for simulating User–Agent–Environment interactions, that functions as a scalable integration-test generator for specific agent deployments. FUSE works by: (1) constructing multi-step tasks by sampling from a Tool–Relationship Graph, (2) simulating closed-loop conversations with configurable user and environment archetypes, (3) evaluating outcomes with Procedural Alignment (Procedure Alignment Score), end-to-end success (Outcome Success), and simulation faithfulness (Meta Evaluation) including human alignment and sim-to-real transfer.
Primary Area: datasets and benchmarks
Submission Number: 5786
Loading