Keywords: Workflow execution, LLM robustness, Probabilistic Tool Behavior
TL;DR: We propose PILOT-Bench, a benchmark that evaluates LLM workflow execution under simulated realistic conditions with probabilistic tool failures and variable instruction quality.
Abstract: We introduce PILOT-Bench, a benchmark that evaluates LLM workflow execution under simulated realistic conditions of instruction quality variability and tool execution uncertainty. Unlike existing benchmarks that encounter these challenges incidentally, our work makes uncertainty the primary focus of systematic study. The benchmark incorporates three key aspects: (1) modeling of probabilistic tool behaviors through parameterized error models that simulate real-world API failure patterns, (2) provision of MDP-derived workflows that maximize expected success rates, and (3) systematic evaluation of model robustness through controlled perturbations of workflow instruction quality. Our construction pipeline generates 5,040 tasks from a tool library of 30 APIs. The evaluation conducted across widely used large language models under conditions of probabilistic tool failures and varying instruction quality reveals notable performance differences. Specifically, MDP-optimal workflow prompts achieve an average success rate of 62.1\%, Chain-of-Thought prompts yield an average success rate of 50.8\%, and flawed workflow prompts result in an average success rate of 54.3\%. Our benchmark is available at \url{https://github.com/PilotBenchAnonymous/PilotBench}.
Primary Area: datasets and benchmarks
Submission Number: 10859
Loading