A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents

Cong Minh Tran; Issam Falih; Hatim CHAHDI; Romain DE LA SOUCHERE

A Protocol-Driven Platform for Agent-Agnostic Evaluation of LLM Agents

Cong Minh Tran, Issam Falih, Hatim CHAHDI, Romain DE LA SOUCHERE

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Agents, Agent Evaluation, Benchmarking, Evaluation Protocols, Reproducibility, Metrics Taxonomy

Abstract: The fast growth of LLM-based agents has outstripped efforts to build solid evaluation standards, causing mismatched formats and idiosyncratic reporting that often make fair comparisons tricky. To tackle this, we created a platform around a protocol-based setup that keeps evaluations separate from how agents work inside, zeroing in on what we can actually see and measure. At the heart is our Remote Agent Integration Protocol (RAIP), a no-frills connection with just two endpoints: one for retrieve data schemas (/info) and another for running tasks (/invoke). Hook it up with TaskConfig—our handy layer for crafting clever input templates and grabbing outputs steadily via JMESPath—and switching agents turns effortless, no extra fiddling needed. Our benchmark structure ensures reproducibility by locking in versions. In addition, we consolidate prevalent metrics into 11 well-defined types within 4 categories for comprehensive evaluation. To check if it works, we ran a Course Advisor task with 30 examples, pitting a Teaching Assistant agent against three datasets—searching, details, and recommendation—while also comparing function-calling and workflow-graph architectures. The outcomes highlight RAIP’s smooth swaps, no changes needed to data or scoring. In the end, this kind of standard could lead to evaluations that feel more even-handed, and expandable.

Submission Number: 224

Loading