Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation

Published: 28 Sept 2025, Last Modified: 20 Oct 2025SEA @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Multi-turn, Tool-augmented LLM, Trajectory Planning, AI Agent Evaluation, Benchmark, LLM
TL;DR: Traxgen is a Python toolkit that generates deterministic, gold-standard trajectories from structured workflows, running 17,000× faster than LLMs. We use it to benchmark LLM planning across workflow complexity, input format, and prompt style.
Abstract: As AI agents take on complex, goal-driven workflows, response-level evaluation becomes insufficient. Trajectory‐level evaluation offers deeper insight but typically relies on high‑quality reference trajectories that are costly to curate or prone to LLM noise. We introduce Traxgen, a Python toolkit that constructs gold‑standard trajectories via directed acyclic graphs (DAGs) built from structured workflow specifications and user data. Traxgen generates deterministic trajectories that align with human‑validated references and achieve average median speedups of over 17,000× compared to LLM‑based methods. To probe LLM reasoning, we compared models across three workflow complexities (simple, intermediate, complex), two input formats (natural language vs. JSON), and three prompt styles (Vanilla, ReAct, and ReAct‑few‑shot). While LLM performance varied, Traxgen outperformed every configuration in both accuracy and efficiency. Our results underscore LLM planning limitations and establish Traxgen as a scalable, resource-efficient tool for evaluating planning-intensive AI agents in compliance-critical domains.
Archival Option: The authors of this submission want it to appear in the archival proceedings.
Submission Number: 80
Loading