General Agent Evaluation

Published: 01 Mar 2026, Last Modified: 24 Apr 2026ICLR 2026 AIWILDEveryoneRevisionsCC BY 4.0
Keywords: General, Generalist, Agent, AGI
Abstract: General-purpose agents perform tasks in unfamiliar environments without domain-specific customization. Yet no study has systematically measured how agent architecture contributes to their performance across heterogeneous protocols and diverse environments. This is the first systematic study, comparing tool-calling, MCP, code-generation, and CLI agents on the same benchmarks with the same models. Two gaps blocked it: existing harnesses restrict agents to a single protocol class (web for BrowserGym, CLI for Harbor) or require manual per-benchmark wiring; and benchmarks themselves assume a human integrator who customizes the agent for each benchmark. We close these gaps with three contributions: (1) the Unified Protocol, a benchmark-agent mediation layer derived from existing interaction patterns; (2) Exgentic, an evaluation harness that surfaces benchmarks to any general-purpose agent paired with any backbone model; and (3) the first Open General Agent Leaderboard, a full factorial over five agents × three LLMs × six benchmarks spanning software engineering, customer service, deep research, and personal assistance. Our key findings are: (i) general agents adapt across all five domains with no per-domain customization, producing non-trivial performance; (ii) model choice dominates variance 85-fold over agent architecture, yet agent choice still swings results up to 11 percentage points within a single model; (iii) on more than half of the benchmarks, general agents match or beat top published domain-specific scores. We release everything at www.exgentic.ai.
PDF: pdf
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 118
Loading