From Observation to Abstractions: Efficient In-Context Learning from Human Feedback and Visual Demonstrations for VLM Agents

Published: 20 Jun 2024, Last Modified: 07 Aug 2024TAFM@RLC 2024EveryoneRevisionsBibTeXCC BY 4.0
Track Selection: Short paper track.
Keywords: short-paper track, Multi-modal LLMs, VLM Agents, In-Context Learning, Visual Demonstrations, Human Feedback, Reinforcement Learning, Instruction Following, Autonomous Web Agents, Ego4D
TL;DR: ICAL improves VLM agents by in-context learning of vision and language abstractions from minimal visual demos and human feedback, improving decision-making in new tasks.
Abstract: We propose an efficient method, In-Context Abstraction Learning (ICAL), to improve in-context VLM agents from sub-optimal demonstrations and human feedback. Specifically, given a noisy demonstration for a task in a new domain, LLMs/VLMs are used to fix inefficient actions and annotate four types of cognitive abstractions. These abstractions are then refined by executing the trajectory in the environment, guided by natural language feedback from humans. We demonstrate that this method rapidly learns useful experience abstractions. Our ICAL agent improves on the state-of-the-art when tested in dialogue-based instruction following in household environments in TEACh, action anticipation in Ego4D, and in multimodal autonomous web agents in VisualWebArena. In TEACh, we improve on the state-of-the-art by 12.6% in goal-condition success, outperforming LLM agents that use the raw visual demonstrations as in context examples without abstraction learning. In VisualWebArena, we improve on the state-of-the-art by an absolute 8.4% and relative 58.7% in overall task success, outperforming VLM agents that use hand-written examples. In Ego4D, we improve 6.4 noun and 1.7 action edit distance over few-shot GPT4V. Lastly, we find that weight fine-tuning and in-context abstraction learning complement each other, with their combination yielding the best performance.
Submission Number: 4
Loading