AdaptAgent: Adapting Multimodal Web Agents with Few-Shot Learning from Human Demonstrations

Published: 10 Oct 2024, Last Modified: 19 Nov 2024AFM 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: mutlimodal web agents, few-shot learning, efficient adaptation
TL;DR: A framework to improve multimodal web agents' adaptability on new unseen websites and domains using few human demonstrations, achieving up to 65.75% relative improvement in task success rates compared to baselines including current SoTA agents.
Abstract: State-of-the-art multimodal web agents, powered by Multimodal Large Language Models (MLLMs), can autonomously execute many web tasks by processing user instructions and interacting with graphical user interfaces (GUIs). Current strategies for building web agents rely on (i) the generalizability of underlying MLLMs and their steerability via prompting, and (ii) large-scale fine-tuning of MLLMs on web-related tasks. However, the ability of web agents to automate tasks on unseen websites and domains remains lacking, limiting their applicability to enterprise-specific and proprietary websites/domains. Beyond generalization from large-scale pre-training and fine-tuning, we propose building agents for few-shot adaptability using human demonstrations. We introduce the AdaptAgent framework that enables both proprietary and open-weights multimodal web agents to adapt to new websites and domains using few human demonstrations (up to 2). Our experiments on two popular benchmarks—Mind2Web and VisualWebArena—show that using in-context demonstration (for proprietary models) or meta-adaptation demonstrations (for meta-learned open-weights models) boosts task success rate by 3.36% to 7.21% over non-adapted state-of-the-art models, corresponding to a relative increase of 21.03% to 65.75%. Our results unlock a complementary axis for developing widely applicable multimodal web agents beyond large-scale pre-training and fine-tuning, emphasizing few-shot adaptability.
Submission Number: 11
Loading