VisGym: Diverse, Customizable, Scalable Environments for Multimodal Agents

Published: 02 Mar 2026, Last Modified: 05 Mar 2026ICLR 2026 Workshop World ModelsEveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision-language models, agents, environments
TL;DR: Progress in multimodal interactive intelligence requires rethinking how models explore, act, and learn from visual feedback.
Abstract: Modern Vision–Language Models (VLMs) remain poorly characterized in multi-step visual interactions, particularly in how they integrate perception, memory, and action over long horizons. We introduce VisGym, a gymnasium of 17 environments for evaluating and training VLMs. The suite spans symbolic puzzles, real-image understanding, navigation, and manipulation, and provides flexible controls over difficulty, input representation, planning horizon, and feedback. We also provide multi-step solvers that generate structured demonstrations, enabling supervised finetuning. Our evaluations show that all frontier models struggle in interactive settings, achieving low success rates in both the easy (26.8%) and hard (12.6%) configurations. Our experiments reveal notable limitations: models struggle to effectively leverage long context, performing worse with an unbounded history than with truncated windows. Furthermore, we find that several text-based symbolic tasks become substantially harder once rendered visually. However, explicit goal observations, textual feedback, and exploratory demonstrations in partially observable or unknown-dynamics settings for supervised finetuning yield consistent gains, and solver-generated multi-step demonstrations generalize across tasks and domains. Together, these results position VisGym as a principled and scalable foundation for diagnosing and training visually interactive agents.
Submission Number: 8
Loading