Sliding Puzzles Gym: A Scalable Benchmark for State Representation in Visual Reinforcement Learning

Published: 01 May 2025, Last Modified: 18 Jun 2025ICML 2025 posterEveryoneRevisionsBibTeXCC BY 4.0
TL;DR: SPGym extends the 8-tile puzzle to evaluate RL agents by scaling representation learning complexity while keeping environment dynamics fixed, revealing opportunities for advancing representation learning for decision-making research.
Abstract: Effective visual representation learning is crucial for reinforcement learning (RL) agents to extract task-relevant information from raw sensory inputs and generalize across diverse environments. However, existing RL benchmarks lack the ability to systematically evaluate representation learning capabilities in isolation from other learning challenges. To address this gap, we introduce the Sliding Puzzles Gym (SPGym), a novel benchmark that transforms the classic 8-tile puzzle into a visual RL task with images drawn from arbitrarily large datasets. SPGym's key innovation lies in its ability to precisely control representation learning complexity through adjustable grid sizes and image pools, while maintaining fixed environment dynamics, observation, and action spaces. This design enables researchers to isolate and scale the visual representation challenge independently of other learning components. Through extensive experiments with model-free and model-based RL algorithms, we uncover fundamental limitations in current methods' ability to handle visual diversity. As we increase the pool of possible images, all algorithms exhibit in- and out-of-distribution performance degradation, with sophisticated representation learning techniques often underperforming simpler approaches like data augmentation. These findings highlight critical gaps in visual representation learning for RL and establish SPGym as a valuable tool for driving progress in robust, generalizable decision-making systems.
Lay Summary: Teaching AI systems to interpret visual information is crucial for applications ranging from robotic navigation to game-playing agents. However, existing tests for these AI systems mix together visual understanding with other skills, making it difficult to tell whether poor performance comes from not seeing properly or from bad decision-making. To address this challenge, we created a new testing framework called Sliding Puzzles Gym, based on the classic sliding tile puzzle game. Instead of numbered tiles, our puzzles use pieces of real photographs. We can control how challenging the visual task is by changing how many different images the AI sees during training, while keeping everything else about the puzzle exactly the same. This lets researchers isolate and study just the visual learning component of these decision-making agents. Our findings reveal a critical limitation in current AI systems: as we increased the variety of images, all tested methods struggled more and performed worse, even when evaluated on previously seen images. Surprisingly, simple techniques like slightly modifying training images often outperformed sophisticated methods. These results expose fundamental gaps in how current AI systems process visual information and provide researchers with a powerful tool to develop more robust and reliable visual AI systems.
Link To Code: https://github.com/bryanoliveira/sliding-puzzles-gym
Primary Area: Reinforcement Learning->Deep RL
Keywords: reinforcement learning, representation learning, benchmark
Submission Number: 12648
Loading