Keywords: Meta-evaluation, Agentic Evaluation, Synthetic Data
Abstract: Agent evaluation is often performed on static datasets of execution trajectories, but real traces may be sensitive, proprietary, or too small to support comprehensive testing. Practitioners may therefore replace or augment real datasets with synthetic ones, often without quantifying whether the synthetic data actually reflects the real data distribution. We introduce ESDAE, an evaluation framework for assessing how well synthetic benchmarks for multi-turn, tool-calling agents replicate the characteristics of real data trajectories. ESDAE assesses the quality of synthetic data relative to a real dataset across four metric categories: (i) task instructions and intermediate responses, (ii) tool calls, (iii) final outputs, and (iv) downstream evaluation. We evaluate ESDAE using recent agent benchmarks and test common synthetic data failure modes via controlled generation schemes. ESDAE detects fine-grained variations in both data fidelity and diversity, and shows that no single metric is sufficient to fully characterize synthetic data quality, motivating a multi-axis evaluation of synthetic data for agent testing.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 47
Loading