Abstract: Online fine-tuning vision-language model (VLM) agents with reinforcement learning (RL) has shown promise for equipping agents with multi-step, goal-oriented capabilities in dynamic environments. However, their open-ended textual action space and non-end-to-end nature of action generation present significant challenges to effective online exploration in RL, e.g., explosion of the exploration space. We propose a novel online fine-tuning method, Counterfactual Soft Reinforcement Learning (CoSo), better suited to the textual output space of VLM agents. Compared to prior methods that assign uniform uncertainty to all tokens, CoSo leverages counterfactual reasoning to dynamically assess the causal influence of individual tokens on post-processed actions. By prioritizing the exploration of action-critical tokens while reducing the impact of semantically redundant or low-impact tokens, CoSo enables a more targeted and efficient online rollout process. We provide theoretical analysis proving CoSo's convergence and policy improvement guarantees, and extensive empirical evaluations supporting CoSo's effectiveness. Our results across a diverse set of agent tasks, including Android device control, card gaming, and embodied AI, highlight its remarkable ability to enhance exploration efficiency and deliver consistent performance gains. The code is available at https://github.com/langfengQ/CoSo.
Lay Summary: Large vision-language model (VLM) agents can already read screens and describe what to do, but teaching them to take useful actions in the real world—like operating a smartphone, playing a game, or controlling a robot—is still a major challenge. One key reason is that their decisions are made in text, which is far more complex than simple numeric commands. Much of the trial-and-error ends up exploring meaningless parts of the text that don’t affect the actual action. To solve this, we developed a new method called Counterfactual Soft Reinforcement Learning (CoSo). It figures out which tokens in the generated text actually influence the agent’s action, and focuses learning on just those important tokens. This allows the system to learn much faster and make smarter decisions with less guesswork. We tested CoSo on tasks like controlling Android devices, playing card games, and navigating virtual environments. In all cases, it made the agents more effective and efficient compared to older training methods.
Link To Code: https://github.com/langfengQ/CoSo
Primary Area: Deep Learning->Foundation Models
Keywords: vision-language model, agent, reinforcement learning, online fine-tuning, counterfactual
Submission Number: 15959
Loading