Coverage-Aware Test Generation for Conversational AI Agents

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0
Keywords: agent evaluation, conversational AI agents, coverage-aware test generation, intent collapse, LLM-as-judge, synthetic data generation, red teaming, agent benchmarking, policy graph, mode collapse
TL;DR: A coverage-aware test-generation pipeline for conversational AI agents deployed in the wild, turning intent collapse into a measurable, fixable reliability problem — and lifting test coverage from 71.7% to 86.6%.
Abstract: Testing multi-capability conversational AI agents requires diverse, high-coverage test scenarios spanning the agent’s full intent space. Existing synthetic data tools generate scenarios from context documents in a single pass, causing intent collapse—the generated set concentrates on the most prominent topic, leaving the majority of agent capabilities untested. We present Scenario Studio (SS), a config-driven pipeline that prevents intent collapse through coverage-aware orchestration: a PolicyGraph models the agent’s full capability space, a CoverageTracker monitors real-time intent coverage, and gap-directed generation steers each round toward uncovered areas. SS also serves as a pluggable orchestration layer—any open-source generation or red-teaming tool can be integrated as a backend module, immediately benefiting from SS’s compiled context and coverage steering. We benchmark SS against DeepEval Synthesizer and two GPT-prompting baselines across two production agents: Agent-A, a diverse general-support assistant spanning 7+ intent families (127 ground-truth queries), and Agent-B, a narrow ride-booking B2B agent (106 queries). SS achieves 86.6% ground-truth coverage on the diverse agent versus 15.7% for DeepEval—a 5.5 improvement with non-overlapping 95% bootstrap confidence intervals (p<0.05). With a comparable frontier model, SS outperforms premium GPT-5.2 baselines by +14.9pp, demonstrating that architecture matters more than model capability. A dual-judge study confirms 89.0–95.3% per-item agreement with identical tool rankings, and all gains on the diverse agent are statistically significant under both judges. SS also generates adversarial scenarios across 12+ attack categories via hybrid red-teaming backends (native + DeepTeam + PromptFoo) and supports consolidation that removes 31–37% of scenarios while slightly improving coverage.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 35
Loading