Keywords: Reireinforcement learning, neural network, network architecture, benchmark
Abstract: Reinforcement learning (RL) has advanced significantly through the application of diverse neural network architectures. In this study, we systematically evaluate the performance of several architectures within RL tasks using a widely adopted policy gradient algorithm, Proximal Policy Optimization (PPO). The architectures considered include Long Short-Term Memory (LSTM), Multi-Layer Perceptron (MLP), Mamba/Mamba-2, Transformer-XL, Gated Transformer-XL, and Gated Recurrent Unit (GRU). Through comprehensive experiments spanning continuous control, discrete decision-making, and memory-based environments, we uncover architecture-specific strengths and limitations. Our results show that: (1) MLPs excel in fully observable continuous control tasks, offering an effective balance between performance and efficiency; (2) recurrent architectures such as LSTM and GRU provide robust performance in partially observable settings with moderate memory demands; (3) Mamba models achieve up to 4.5× higher throughput than LSTM and 3.9× higher than GRU, while maintaining comparable performance; and (4) only Transformer-XL, Gated Transformer-XL, and Mamba-2 succeed on the most memory-intensive tasks, with Mamba-2 requiring 8× less memory than Transformer-XL. These findings highlight the trade-offs among architectures and provide actionable insights for selecting appropriate models in PPO-based RL under different task characteristics and computational constraints.
Primary Area: datasets and benchmarks
Submission Number: 13370
Loading