Abstract: Reinforcement Learning (RL) algorithms often struggle with low training efficiency. A common approach to address this challenge is integrating model-based planning algorithms, such as Monte Carlo Tree Search (MCTS) or Value Iteration (VI), into the environmental model. However, VI faces a significant limitation: it requires iterating over a large tensor with dimensions $|\mathcal{S}|\times |\mathcal{A}| \times |\mathcal{S}|$, where $\mathcal{S}$ and $\mathcal{A}$ represent the state and action spaces, respectively. This process updates the value of the preceding state $s_{t-1}$ based on the succeeding state $s_t$ through value propagation, resulting in computationally intensive operations. To enhance the training efficiency of RL algorithms, we propose improving the efficiency of the value learning process. In deterministic environments with discrete state and action spaces, we observe that on the sampled empirical state-transition graph, a non-branching sequence of transitions—termed a \textit{highway}—can take the agent directly from $s_0$ to $s_T$ without deviation through intermediate states. On these non-branching highways, the value-updating process can be streamlined into a single-step operation, eliminating the need for iterative, step-by-step updates. Building on this observation, we introduce a novel graph structure called the \textit{highway graph} to model state transitions. The highway graph compresses the transition model into a compact representation, where edges can encapsulate multiple state transitions, enabling value propagation across multiple time steps in a single iteration. By integrating the highway graph into RL (as a model-based off-policy RL method), the training process is significantly accelerated, particularly in the early stages of training. Experiments across four categories of environments demonstrate that our method learns significantly faster than established and state-of-the-art model-free and model-based RL algorithms (often by a factor of 10 to 150) while maintaining equal or superior expected returns. Furthermore, a deep neural network-based agent trained using the highway graph exhibits improved generalization capabilities and reduced storage costs.
Submission Length: Long submission (more than 12 pages of main content)
Changes Since Last Submission: N/A
Code: https://github.com/coodest/highwayRL
Assigned Action Editor: ~Steven_Stenberg_Hansen1
Submission Number: 2667
Loading