Exploring Expert Failures Improves LLM Agent Tuning

Li-Cheng Lan; Andrew Bai; Minhao Cheng; Cho-Jui Hsieh; Tianyi Zhou

Exploring Expert Failures Improves LLM Agent Tuning

Li-Cheng Lan, Andrew Bai, Minhao Cheng, Cho-Jui Hsieh, Tianyi Zhou

20 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM agent, Finetuning, Imitation Learning, Reinforcement learning

TL;DR: Our method, EEF, leverages beneficial actions from failed expert trajectories to enhance LLM agents, achieving SOTA performance on challenging tasks like WebShop and SciWorld.

Abstract: Large Language Models (LLMs) have tremendous potential as agents, excelling in tasks requiring multiple rounds of decision-making. For large-scale deployment, a smaller LLM is commonly fine-tuned by learning from teacher-model trajectories and subsequently improving itself via interaction with the environment. A key challenge is that many complex training tasks never yield a successful trajectory (zero reward): the teacher's trajectories fail to solve them, and the student’s limited exploration cannot discover one despite many attempts. Without reward signals during training, the student is unlikely to solve similarly difficult test tasks. Applying Rejection Sampling Fine-Tuning (RFT) to WebShop highlights the issue: GPT-4 (the teacher) may succeed on only 36\% of the training tasks, and RFT inherently favors actions drawn from those successes. As a result, the student cannot complete most complex tasks for which the teacher does not provide a direct solution because these tasks require more advanced action sequences. To discover reward signals in these complex tasks, we examined the failed teacher trajectories on these challenging tasks, and found that teacher's trajectories often contain valuable guidance—such as plans and key actions—that student seldom used during its exploration. Motivated by this insight, we introduce Exploring Expert Failures (EEF), which uses expert actions to improve the exploration during training and carefully incorporates them into the training by masking out potentially harmful actions to prevent contamination of the learning process. This further allows us to let our student model utilize additional weaker yet more cost-efficient teachers, such as GPT-3.5 Turbo, without inheriting the weaker teacher's suboptimal behaviors. Consequently, EEF successfully resolves many previously unsolvable tasks and significantly enhances agent performance on test tasks. Notably, our approach achieved a remarkable 62\% win rate in WebShop, surpassing both RFT (53.6\%) and GPT-4 (35.6\%). To the best of our knowledge, this establishes a new state-of-the-art, achieving a score of 0.81 on WebShop and 81/100 on SciWorld, two widely used and challenging tasks for evaluating LLM agents.

Primary Area: generative models

Submission Number: 23892

Loading