Abstract: LLM agents are advancing in handling web-based tasks. However, most LLM web agents rely on prompting general-purpose, proprietary models like GPT-4, which are not specifically trained to process web languages (e.g., HTML) or perform long-horizon planning. We explore an alternative approach that fine-tunes open-source LLMs using production-scale workflow data collected from over 250 domains corresponding to 6 billion tokens. This approach shows substantial gains over prompting-based agents on existing benchmarks—our agent achieves state-of-the-art action generation performance on the Mind2Web benchmark and improves the task success rate by 7.3% over existing prompting-based agents on WebArena. We perform detailed ablation studies on various fine-tuning design choices and provide valuable insights into LLM selection, training recipes, context window optimization, and the effect of dataset sizes.
Submission Length: Regular submission (no more than 12 pages of main content)
Assigned Action Editor: ~Aleksandra_Faust1
Submission Number: 4306
Loading