Abstract: As AI agents take on complex, goal-driven workflows, response-level evaluation becomes insufficient. Trajectory‐level evaluation offers deeper insight but typically relies on high‑quality reference trajectories that are costly to curate or prone to LLM sampling noise. We introduce Traxgen, a Python toolkit that constructs gold‑standard trajectories via directed acyclic graphs (DAGs) built from structured workflow specifications and user data. Traxgen generates deterministic trajectories that align perfectly with human‑validated references and achieve average median speedups of over 17,000× compared to LLM‑based methods. To probe LLM reasoning, we compared multiple models across three workflow complexities (simple, intermediate, complex), two input formats (natural language vs. JSON), and three prompt styles (vanilla, ReAct, and ReAct‑few‑shot). While LLM performance varied, Traxgen outperformed every configuration in both accuracy and efficiency. Our results shed light on LLM planning limitations and establish Traxgen as a more scalable, resource‑efficient alternative for reproducible evaluation of planning‑intensive AI agents.
Paper Type: Long
Research Area: Dialogue and Interactive Systems
Research Area Keywords: AI Agents, Evaluation, Benchmarking, Trajectory, Multi-Agent
Contribution Types: NLP engineering experiment, Publicly available software and/or pre-trained models
Languages Studied: Python
Submission Number: 4973
Loading