Compositional Underdetermination in AI Agents: When Behavioral Success Is Not Compositional Evidence
Keywords: compositional generalization, agent evaluation, LLM agents, mechanistic interpretability, evaluation methodology, tool use, retrieval-augmented generation, AI safety, behavioral equivalence, diagnostic evaluation
TL;DR: Behavioral success on agent benchmarks underdetermines whether the agent actually composes; we give a six-surface taxonomy and a per-surface minimum reviewable diagnostic that any existing benchmark can adopt as a reporting standard.
Abstract: Modern agent evaluations often report end-to-end success rates, yet such scores rarely establish whether an agent has acquired reusable compositional structure or brittle task-specific routines. We name this gap compositional underdetermination: behavioral success on compositional tasks can remain compatible with multiple incompatible explanations of how an agent plans, calls tools, retrieves information, represents state, or composes safety constraints. We formalize the gap as a relation on behavioral equivalence classes and argue that compositionality claims for agents require paired behavioral and diagnostic evidence. The paper contributes an agent-specific taxonomy of compositionality surfaces, a minimum reviewable diagnostic for each surface, and a reporting standard that existing benchmarks can adopt. The result is a lightweight reporting protocol for agent benchmarks that connects compositional generalization, interpretability diagnostics, and safety-constraint evaluation without requiring a new benchmark.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 125
Loading