Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation

Maria Emilia Mazzolenis; Ruirui Zhang

Traxgen: Ground-Truth Trajectory Generation for AI Agent Evaluation

Maria Emilia Mazzolenis, Ruirui Zhang

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: Multi-turn, Tool-augmented LLM, Trajectory Planning, AI Agent Evaluation, Benchmark, LLM

TL;DR: Traxgen is a Python toolkit that generates deterministic, gold-standard trajectories from structured workflows, running 17,000× faster than LLMs. We use it to benchmark LLM planning across workflow complexity, input format, and prompt style.

Abstract: As AI agents take on complex, goal-driven workflows, response-level evaluation becomes insufficient. Trajectory‐level evaluation offers deeper insight but typically relies on high‑quality reference trajectories that are costly to curate or prone to LLM noise. We introduce \texttt{Traxgen}, a Python toolkit that constructs gold‑standard trajectories via directed acyclic graphs (DAGs) built from structured workflow specifications and user data. \texttt{Traxgen} generates deterministic trajectories that align with human‑validated references and achieve average median speedups of over 17,000× compared to LLM‑based methods. To probe LLM reasoning, we compared models across three workflow complexities (simple, intermediate, complex), two input formats (natural language vs. JSON), and three prompt styles (Vanilla, ReAct, and ReAct‑few‑shot). While LLM performance varied, \texttt{Traxgen} outperformed every configuration in both accuracy and efficiency. Our results underscore LLM planning limitations and establish \texttt{Traxgen} as a scalable, resource-efficient tool for evaluating planning-intensive AI agents in compliance-critical domains.

Submission Number: 103

Loading