Where Do Agents Differ? Interpretable Rule Discovery for Performance Differences Across Models and Data
Keywords: Evaluation, subgroups
Abstract: Agentic systems are trained and evaluated with a variety of backend models over a heterogeneous collection of tasks.
Typically, evaluation focuses on aggregate metrics over predefined categories and considers models in isolation.
In this work, we aim to identify characteristics of the input space *where* the performance between two configurations, e.g. different backend models, differs substantially.
To this end, we discover rules interpretable rules that describe task regimes with pronounced distributional differences.
We demonstrate our approach in two agentic use cases: On *swe-bench*, we contrast a state-of-the-art coding agent across different backends, while on the *ChartQA* benchmark, we compare performance on synthetic and real data.
Our results show that performance differences are highly structured and can reverse across task regimes, revealing insights that are not captured by standard evaluation.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 104
Loading