Efficient Agent Evaluation via Diversity-Guided User Simulation

Itay Nakash; George Kour; Ateret Anaby Tavor

Efficient Agent Evaluation via Diversity-Guided User Simulation

Itay Nakash, George Kour, Ateret Anaby Tavor

Published: 18 Apr 2026, Last Modified: 23 Apr 2026ACL 2026 Industry Track PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agents, Agent Evaluation, Evaluation, Multi-Turn, Robustness, User Simulation

TL;DR: DIVERT is a coverage-guided user simulation framework for efficient multi-turn LLM agent evaluation that improves failure discovery by reusing conversation prefixes and branching from critical mid-trajectory states with diverse user responses.

Abstract: Large language models (LLMs) are increasingly deployed as customer-facing agents, yet evaluating their reliability remains challenging due to stochastic, multi-turn interactions. Current evaluation protocols rely on linear Monte Carlo rollouts of full agent-user conversations to estimate success. This approach is computationally inefficient - reprocessing identical conversation prefixes across runs, and often fails to uncover deep failure modes triggered by rare user behaviors. We introduce DIVERT (Diversity-Induced Evaluation via Branching of Trajectories), a snapshot-based, coverage-guided user simulation framework for efficient and systematic exploration of multi-turn agent behavior. DIVERT captures the full agent–environment state at critical junctions and resumes execution from these points, reusing shared prefixes to avoid redundant regeneration and reduce token cost. From each junction, it branches with targeted, diverse user responses, enabling directed exploration of alternative interaction paths while preserving task intent. By reallocating computation from redundant restarts to behaviorally salient mid-trajectory states, DIVERT steers evaluation toward under-explored semantic regions and rare interaction failures. Experiments on realistic multi-domain benchmarks show that our method consistently improves failure discovery efficiency and task-level coverage compared to standard linear rollout evaluation, without increasing overall cost.

Submission Type: Emerging

Copyright Form: pdf

Submission Number: 394

Loading