Abstract: This paper presents meta-sparsity, a framework for learning model sparsity, basically learning the parameter that controls the degree of sparsity, that allows deep neural networks (DNNs) to inherently generate optimal sparse shared structures in multi-task learning (MTL) setting. This proposed approach enables the dynamic learning of sparsity patterns across a variety of tasks, unlike traditional sparsity methods that rely heavily on manual hyperparameter tuning. Inspired by Model Agnostic Meta-Learning (MAML), the emphasis is on learning shared and optimally sparse parameters in multi-task scenarios by implementing a penalty-based, channel-wise structured sparsity during the meta-training phase. This method improves the model’s efficacy by removing unnecessary parameters and enhances its ability to handle both seen and previously unseen tasks. The effectiveness of meta-sparsity is rigorously evaluated by extensive experiments on two datasets, NYU-v2 and CelebAMask-HQ, covering a broad spectrum of tasks ranging from pixel-level to image-level predictions. The results show that the proposed approach performs well across many tasks, indicating its potential as a versatile tool for creating efficient and adaptable sparse neural networks. This work, therefore, presents an approach towards learning sparsity, contributing to the efforts in the field of sparse neural networks and suggesting new directions for research towards parsimonious models.
Submission Length: Long submission (more than 12 pages of main content)
Previous TMLR Submission Url: https://openreview.net/forum?id=FpP8BwJiX0&referrer=%5BAuthor%20Console%5D(%2Fgroup%3Fid%3DTMLR%2FAuthors%23your-submissions)
Changes Since Last Submission: In preparing this revised submission, we carefully considered the feedback received from the reviewers of the previous version. Below, we summarize the key changes and additions made to the manuscript:
1. Detailed Explanation of Gradient Handling for $\lambda$: A new subsection (3.4.1) titled **"Behavior of $\lambda$ during Meta-optimization"** has been added to provide a clear and formal explanation of the gradient flow with respect to the sparsity-inducing hyperparameter $\lambda$. This section explains how both the explicit and implicit gradients are accounted for during optimization and references relevant work in bilevel optimization to support the formulation.
2. Computational Cost Analysis: A new table (Table 6) and discussion compare training time and tuning complexity across all methods, showing that meta-sparsity has a comparable cost when tuning is considered. We emphasize that all methods were evaluated under the same sparsity level (~44%) and explain why performance, not just runtime, is a more meaningful comparison metric.
The decision to reject the previous version was primarily based on (1) the lack of a detailed explanation regarding gradient handling for the sparsity-inducing hyperparameter $\lambda$ and (2) the absence of a computational cost analysis. In this updated manuscript, we have directly addressed both points through a new theoretical section and an extended empirical comparison.
Assigned Action Editor: ~Han_Zhao1
Submission Number: 4552
Loading