ResiliBench: Evaluating Agentic Workflow Adaptation in Stochastic Environments

Ruicheng Ao; Zeping Min; Tingyu Zhu; Wotao Yin; Xinshang Wang

ResiliBench: Evaluating Agentic Workflow Adaptation in Stochastic Environments

Ruicheng Ao, Zeping Min, Tingyu Zhu, Wotao Yin, Xinshang Wang

Published: 26 Jan 2026, Last Modified: 02 Mar 2026ICLR 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Workflow execution, LLM robustness, Probabilistic Tool Behavior

TL;DR: We introduce ResiliBench, a benchmark that evaluates LLM workflow execution under probabilistic tool failures and flawed instructions.

Abstract: We introduce ResiliBench, a benchmark that evaluates LLM workflow execution under simulated realistic conditions of instruction quality variability and tool execution uncertainty. Unlike existing benchmarks that encounter these challenges incidentally, our work makes uncertainty the primary focus of systematic study. The benchmark incorporates three key aspects: (1) modeling of probabilistic tool behaviors through parameterized error models that simulate real-world API failure patterns, (2) provision of MDP-derived workflows that maximize expected success rates, and (3) systematic evaluation of model robustness through controlled perturbations of workflow instruction quality. Our construction pipeline generates 5,040 tasks from a tool library of 30 APIs. The evaluation conducted across widely used large language models under conditions of probabilistic tool failures and varying instruction quality reveals notable performance differences. Specifically, MDP-optimal workflow prompts achieve an average success rate of 62.1\%, Chain-of-Thought prompts yield an average success rate of 50.8\%, and flawed workflow prompts result in an average success rate of 54.3\%. Our benchmark is available at https://github.com/Archer222arc/ResiliBench.

Primary Area: datasets and benchmarks

Submission Number: 10859

Loading