Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation}

Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation}

ACL ARR 2025 May Submission4973 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: As AI agents take on complex, goal-driven workflows, response-level evaluation becomes insufficient. Trajectory‐level evaluation offers deeper insight but typically relies on high‑quality reference trajectories that are costly to curate or prone to LLM sampling noise. We introduce Traxgen, a Python toolkit that constructs gold‑standard trajectories via directed acyclic graphs (DAGs) built from structured workflow specifications and user data. Traxgen generates deterministic trajectories that align perfectly with human‑validated references and achieve average median speedups of over 17,000× compared to LLM‑based methods. To probe LLM reasoning, we compared multiple models across three workflow complexities (simple, intermediate, complex), two input formats (natural language vs. JSON), and three prompt styles (vanilla, ReAct, and ReAct‑few‑shot). While LLM performance varied, Traxgen outperformed every configuration in both accuracy and efficiency. Our results shed light on LLM planning limitations and establish Traxgen as a more scalable, resource‑efficient alternative for reproducible evaluation of planning‑intensive AI agents.

Paper Type: Long

Research Area: Dialogue and Interactive Systems

Research Area Keywords: AI Agents, Evaluation, Benchmarking, Trajectory, Multi-Agent

Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models

Languages Studied: Python

Submission Number: 4973

Loading