A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents

Published: 24 Sept 2025, Last Modified: 01 Dec 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Agents, Agent Evaluation, Benchmarking, Evaluation Protocols, Reproducibility, Metrics Taxonomy
Abstract: The fast growth of LLM-based agents has outpaced efforts to build solid evaluation standards, causing mismatched formats and idiosyncratic reporting that often makes fair comparisons challenging. To tackle this, we created a platform around a protocol-based setup that keeps evaluations separate from how agents work internally, focusing on what we can actually see and measure. At the heart is our Remote Agent Integration Protocol (RAIP), a minimalist connection with just two endpoints: one for retrieving data schemas (/info) and another for running tasks (/invoke). Combined with TaskConfig, our declarative layer for crafting dynamic input templates and extracting outputs deterministically via JMESPath, this setup makes interchanging agents frictionless, requiring no manual adaptation. Our benchmark structure ensures reproducibility by locking in versions. In addition, we consolidate prevalent metrics into 11 well-defined types within 4 categories for comprehensive evaluation. To validate feasibility, we ran a Course Advisor task with 30 examples, evaluating a Teaching Assistant agent across three datasets (searching, details, and recommendation) while also comparing function-calling and workflow-graph architectures. The outcomes highlight RAIP’s interoperability, requiring no changes to data or scoring. Ultimately, this standardized approach could lead to evaluations that are more equitable, reproducible, and expandable.
Submission Number: 224
Loading