Keywords: LM planning, LLM agents, in-episode learning, ALFWorld, dual-process architecture
TL;DR: A dual-process architecture for within-episode failure recovery in LM-based planning agents: routes between TextGrad-style continuous policy refinement and Reflexion-style causal diagnosis using a per-step progress scor
Abstract: LLM agents stall on solvable interactive tasks: the agent commits to a wrong approach early, environment feedback is uninformative, minor variations repeat, and the step budget exhausts before the next trial begins. The information to escape sits in the post-failure trajectory, but existing methods do not explicitly enable in-episode recovery by combining local refinement and causal reasoning within a single adaptive framework. We present DPR, a dual-process architecture for in-episode failure recovery: a fast process applies TextGrad-style continuous refinement every $k{=}3$ steps; a slow process performs Reflexion-style causal diagnosis when $m{=}5$ consecutive low-progress scores fire the routing gate. Each slow activation emits the three failure-recovery artifacts (reproducible trigger, diagnostic, verified fix). On ALFWorld 134 tasks, $n{=}10$ seeds, no demonstrations, DPR lifts open-weight Qwen-3-8B from $35.1\%$ to $75.4\%$ ($+40.3$pp), beating compute-matched 1-shot LATS by $+2.7$pp ($p{\approx}0.01$), ToT by $+5.7$pp ($p{<}10^{-4}$), and Self-Refine by $+6.7$pp ($p{<}10^{-5}$); on GPT-5 the lift is $46.3{\to}88.1\%$ ($+41.8$pp). The $1.5$pp cross-model lift difference is within seed noise ($p{\approx}0.13$), suggesting that the routing mechanism generalizes across model scales rather than depending on frontier-specific capability. The architecture establishes a strong demo-free operating point, complementary to demo-bootstrapped methods which occupy a different regime.
Submission Number: 211
Loading