Sparsity-Driven Plasticity in Multi-Task Reinforcement Learning

TMLR Paper4962 Authors

26 May 2025 (modified: 22 Jun 2025)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Plasticity loss, a diminishing capacity to adapt as training progresses, is a critical challenge in deep reinforcement learning. We examine this issue in multi-task reinforcement learning (MTRL), where higher representational flexibility is crucial for managing diverse and potentially conflicting task demands. This paper systematically explores how sparsification methods, particularly Gradual Magnitude Pruning (GMP) and Sparse Evolutionary Training (SET), enhance plasticity and consequently improve performance in MTRL agents. We evaluate these approaches across distinct MTRL architectures (shared backbone, Mixture of Experts, Mixture of Orthogonal Experts) on standardized MiniGrid benchmarks, comparing against dense baselines, and a comprehensive range of alternative plasticity-inducing or regularization methods. Our results demonstrate that both GMP and SET effectively mitigate key indicators of plasticity degradation, such as neuron dormancy and representational collapse. These plasticity improvements often correlate with enhanced multi-task performance, with sparse agents frequently outperforming dense counterparts and achieving competitive results against explicit plasticity interventions. Our findings offer insights into the interplay between plasticity, network sparsity, and MTRL designs, highlighting dynamic sparsification as a robust but context-sensitive tool for developing more adaptable MTRL systems.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We sincerely thank all reviewers for their constructive feedback. We have carefully addressed these concerns in the revised manuscript. This rebuttal highlights the major revisions: ## Major Methodological and Experimental Revisions ### Expanded Investigation of Sparsification Strategies: Responding to Reviewers (8sLK, GspR), we significantly expanded our study beyond Gradual Magnitude Pruning (GMP). The revised paper now systematically evaluates and compares GMP with Sparse Evolutionary Training (SET) (Mocanu et. al., 2018) as another primary sparsification method with fixed sparsity. Furthermore, we include Lottery Ticket Hypothesis (LTH) style rewinding (Frankle et. al., 2018) as a preliminary experiment. Notably, SET emerged as a strong competitive alternative to GMP in several scenarios, providing a more robust empirical foundation for our conclusions about the interplay between sparsity and plasticity. ### Comprehensive Hyperparameter Tuning and Ablations: Concerns regarding hyperparameter selection and tuning for both our proposed methods and baselines were raised by all reviewers (8sLK, GspR, BvBA – e.g., “Baselines are not tuned,” “Hyperparameters seem arbitrary”). We have addressed this by: * Introducing extensive new ablation studies for GMP and SET, examining sensitivity to pruning schedules, target sparsity levels, and SET-specific parameters. * Performing thorough hyperparameter sweeps for all baseline methods, including ReDo (dormant thresholds, reset intervals), Reset (reset frequencies, number of resets), and Weight Decay (coefficients across several orders of magnitude), as detailed in Appendix E. ### Comparison with Normalization Layers and Enhanced Optimization Analysis: Responding to Reviewer 8sLK regarding the role of normalization layers, we now include Layer Normalization (LayerNorm) as a baseline across all architectures and benchmarks. We also analyze the interaction of GMP with Weight Decay. Additionally, we have expanded our analysis of optimizer interactions by including experiments combining GMP with PCGrad (Yu et. al., 2020), revealing interesting, though not always additive, synergies. ### Ensuring Training Convergence: To address concerns about under-training (Reviewer GspR: “Training of all models to convergence...”), all experiments in the revised manuscript have been extended to 200 epochs (400 000 timesteps) to ensure convergence. All results and experiments in the revised paper were run with this configuration (including the hyperparameter tuning), and are based on 30 independent runs. ## Structural and Content Revisions ### Expanded and Restructured Related Work: Following requests (Reviewer 8sLK), the "Related Work" section (Section 2.2) has been substantially revised and expanded. It now offers a more comprehensive discussion of prior art in sparsity for RL (including single-task successes of GMP and SET), plasticity loss mechanisms and interventions (now including normalization layers), and MTRL techniques (including orthogonality constraints like Chung et al. (2024) as context for MOORE). ### Major Manuscript Reorganization: The inclusion of SET, new baselines, and optimizer experiments led to a clearer structure: Sec 4 details core effects of GMP/SET vs. dense baselines; Sec 5 covers interactions with alternative plasticity methods and optimizers. ### Explicit Discussion of Limitations: As requested (Reviewers GspR, 8sLK), Section 6 ("Considerations of Sparsification Efficacy and Limitations") now includes a dedicated subsection detailing the practical limitations and considerations for GMP and SET (e.g., training overhead of GMP, implementation complexity of SET for certain layers, hyperparameter tuning effort, architecture-dependent efficacy). ## Clarifications Our primary focus is understanding learning dynamics and plasticity benefits of unstructured sparsity in MTRL, motivating our choice of GMP and SET for their representational flexibility. While validation on even more complex benchmarks (e.g., Meta-World, as suggested by Reviewer 8sLK) is a valuable future direction and we agree that broader validation would further reinforce our claims, we believe our current extensive experimentation across the three MiniGrid benchmarks and three distinct MTRL architectures (MTPPO, MoE, MOORE), with extensively tuned baselines and methods, already provides significant and novel insights into the utility and context-sensitivity of sparse methods in MTRL. Lastly, while this paper is primarily empirical in nature, we recognize the value of complementary theoretical insights and are also pursuing theoretical investigations into these mechanisms for future work. We thank all reviewers again for their valuable feedback and hope that we have addressed most requested changes to make the paper acceptable for TMLR.
Assigned Action Editor: ~Marlos_C._Machado1
Submission Number: 4962
Loading