Abstract: In this paper, we investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL? Using the original DARTS as a convenient baseline, we discover that the discrete architectures found can achieve up to 250% performance compared to manual architecture designs on both discrete and continuous action space environments across off-policy and on-policy RL algorithms, at only 3x more computation time. Furthermore, through numerous ablation studies, we systematically verify that not only does DARTS correctly upweight operations during its supernet phrase, but also gradually improves resulting discrete cells up to 30x more efficiently than random search, suggesting DARTS is surprisingly an effective tool for improving architectures in RL.
Keywords: rl, darts, nas, autorl, darts, differentiable, neural, architecture, search, reinforcement, learning, rl, automated, supernet, discrete, impala, cnn, procgen, dm, control
One-sentence Summary: We investigate the fundamental question: To what extent are gradient-based neural architecture search (NAS) techniques applicable to RL?
Track: Main track
Reproducibility Checklist: Yes
Broader Impact Statement: Yes
Paper Availability And License: Yes
Code Of Conduct: Yes
Reviewers: Xingyou Song, firstname.lastname@example.org
Main Paper And Supplementary Material: pdf
Code And Dataset Supplement: zip
Steps For Environmental Footprint Reduction During Development: Our work investigates the application of DARTS to RL for efficient neural architecture search, and DARTS itself is already a very efficient method, requiring only 3x more compute time (see Appendix A) in the RL domain and integrating minimally with pre-existing pipelines. Thus as a whole, our paper by itself is reducing the environmental footprint of algorithms. In comparison, a naive blackbox optimization-based NAS search in RL would be extremely wasteful and would require on the order of hundreds of evaluations (each evaluation requiring a GPU). The naive NAS search's inefficiency would furthermore be exacerbated by RL's inherent noisiness, which is usually a major cause of issue for general AutoRL techniques.
CPU Hours: 4000
GPU Hours: 270
TPU Hours: 0
Evaluation Metrics: Yes
Estimated CO2e Footprint: 24
Class Of Approaches: differentiable architecture search, reinforcement learning, neural architecture search
Datasets And Benchmarks: DM-Control, Procgen
Performance Metrics: RL Reward, Wallclock training time
Benchmark Performance: Multi-Task/Game Procgen w/ Rainbow and Micro Search Space, Up to 250% performance over IMPALA-CNN baseline on selected games Single-Task/Game Procgen w/ PPO and Macro Search Space, RL-DARTS is competitive with strong IMPALA-CNN baseline, beats fair random search significantly Single-Task DM-Control w/ SAC and Micro Search Space, Up to 30% improvement over 4-layer baseline CNN encoder RL-DARTS vs Random Search w/ Rainbow and Micro Search Space, Up to 30x search efficiency over random search
Benchmark Time: IMPALA-CNN Baselines, 72 GPU days Multi-Task/Game Procgen w/ Rainbow and Micro Search Space, 30 GPU days Single-Task/Game Procgen w/ PPO and Macro Search Space, 120 GPU days Single-Task DM-Control w/ SAC and Micro Search Space, 6 GPU days RL-DARTS vs Random Search w/ Rainbow and Micro Search Space, 50 GPU days