Enhancing Vision-Language Model Training with Reinforcement Learning in Synthetic Worlds for Real-World Success

Published: 19 Dec 2025, Last Modified: 05 Jan 2026AAMAS 2026 FullEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language agents, reinforcement learning
Abstract: Interactive multimodal agents must turn raw visual observations into reliable sequences of structured, language-conditioned actions, yet training such competence under long horizons and sparse feedback remains brittle. We present VL-DAC, a lightweight reinforcement-learning recipe for vision-language agents that is hyperparameter-robust and easy to deploy. VL-DAC performs PPO updates at the token level for actions while learning a step-level value function. This decoupling removes unstable weighting terms and yields faster, more reliable convergence without introducing extra tuning knobs. Training a single VLM in one cheap synthetic environment at a time (MiniWorld, Gym-Cards, ALFWorld, or WebShop) produces policies that transfer beyond their training simulators: +50\% relative on BALROG (agentic control), +5\% relative on the hardest split of VSI-Bench (spatial planning), and +2\% on VisualWebBench (web navigation), with no loss in general image understanding. Together, these results show that a simple, stable RL procedure can train vision–language agents entirely in simulation while delivering measurable gains on agentic, spatial-reasoning, and web-navigation benchmarks.
Area: Generative and Agentic AI (GAAI)
Generative A I: I acknowledge that I have read and will follow this policy.
Submission Number: 1238
Loading