How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda; Santhoshi Ravichandran; Emiliano Penaloza; Hadi Nekoei; Megh Thakkar; Thibault Le Sellier de Chezelles; Nicolas Gontier; Miguel Muñoz-Mármol; Sahar Omidi Shayegan; Stefania Raimondo; Xue Liu; Alexandre Drouin; Laurent Charlin; Alexandre Piché; Alexandre Lacoste; Massimo Caccia

How to Train Your LLM Web Agent: A Statistical Diagnosis

Dheeraj Vattikonda, Santhoshi Ravichandran, Emiliano Penaloza, Hadi Nekoei, Megh Thakkar, Thibault Le Sellier de Chezelles, Nicolas Gontier, Miguel Muñoz-Mármol, Sahar Omidi Shayegan, Stefania Raimondo, Xue Liu, Alexandre Drouin, Laurent Charlin, Alexandre Piché, Alexandre Lacoste, Massimo Caccia

Published: 06 Oct 2025, Last Modified: 04 Nov 2025MTI-LLM @ NeurIPS 2025 PosterEveryoneRevisionsBibTeXCC BY-ND 4.0

Keywords: web agents, reasoning, llms

Abstract: LLM-based web agents have recently made significant progress, but much of it has occurred in closed-source systems—widening the gap with open-source alternatives. Progress has been held back by two key challenges—first, a narrow focus on single- step tasks that overlooks the complexity of multi-step web interactions, and second, the high compute costs required to post-train LLM-based web agents. To address this, we present the first statistically grounded study on compute allocation for LLM web-agent post-training. Our approach uses a two-stage pipeline, training a Llama 3.1 8B student to imitate a Llama 3.3 70B teacher via SFT, followed by on-policy reinforcement learning. We find this process highly sensitive to hyperparameter choices — exhaustive sweeps are impractical. To spare others from expensive trial-and-error, we sample 1,370 configurations and use bootstrapping to estimate effective hyperparameters. Our results show that combining SFT with on-policy RL consistently outperforms either approach alone on both WorkArena and MiniWob++. Further, this strategy only requires 55% of the compute to match the peak of pure SFT on MiniWob++, pushing the compute–performance Pareto frontier and is the only strategy that can close the gap with closed-source models.

Submission Number: 186

Loading