Generative Proto-Sequence: Sequence-Level Decision Making for Long-Horizon Reinforcement Learning

TMLR Paper5634 Authors

14 Aug 2025 (modified: 25 Oct 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Deep reinforcement learning (DRL) methods often face challenges in environments characterized by large state spaces, long action horizons, and sparse rewards, where effective exploration and credit assignment are critical. We introduce Generative Proto-Sequence (GPS), a novel generative DRL approach that produces variable-length discrete action sequences. By generating entire action sequences in a single decision rather than selecting individual actions at each timestep, GPS reduces the temporal decision bottleneck that impedes learning in long-horizon tasks. This sequence-level abstraction provides three key advantages: (1) it facilitates more effective credit assignment by directly connecting state observations with the outcomes of complete behavioral patterns; (2) by committing to coherent multi-step strategies, our approach facilitates better exploration of the state space; and (3) it promotes better generalization by learning macro-behaviors that transfer across similar situations rather than memorizing state-specific responses. Extensive evaluations on mazes of varying sizes and complexities demonstrate that GPS consistently outperforms leading action repetition and temporal methods, where it converges faster and achieves higher success rates across all environments.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We have addressed all the concerns and questions raised by the reviewers. Main changes include: 1) The use of multiple seeds in our primary evaluation, and the reporting of mean and standard deviation. 2) The addition of new experiments with different settings: stochastic ("sticky actions") and partial observability. 3) Additional analysis regarding our proposed approach capability to "self correct". 4) Additional explanations regarding our rationale in the training procedures of our proposed approach. Detailed responses were provided to each reviewer. We would like to thank the reviewers for their time, effort, and constructive suggestions. Best regards, The Authors
Assigned Action Editor: ~Mirco_Mutti1
Submission Number: 5634
Loading