Federated Learning With $L_{0}$ Constraint Via Probabilistic Gates For Sparsity

ICLR 2026 Conference Submission19656 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Federated learning, linear models, sparsity, $L_{0}$ constraint, probabilistic gates, reparameterization, data heterogeneity, client participation heterogneity, maximum entropy principle, free energy, negative ELBO
TL;DR: L_{0} constrained minimization objective can be derived from entropy maximization of stochastic gates, and a fedSGD algorithm is effective in sparsity recovery and statistical performance for linear models, compared to magnitude pruning.
Abstract: Federated Learning (FL) is a distributed machine learning setting that requires multiple clients to collaborate on training a model while maintaining data privacy. The unaddressed inherent sparsity in data often results in overly dense models and poor generalizability under data and client participation heterogeneity. We propose FL with an $L_0$ constraint on the density of non-zero parameters, achieved through a reparameterization using probabilistic gates and their continuous relaxation: originally proposed for sparsity in centralized machine learning. We show that the objective for $L_0$ constrained stochastic minimization naturally arises from an entropy maximization problem of stochastic gates and propose an algorithm based on federated stochastic gradient descent for distributed learning. We demonstrate that the target density ($\rho$) of parameters can be achieved in FL, under data and client participation heterogeneity, with minimal loss in statistical performance for linear models: $\emph{(i)}$ Linear regression (LR). $\emph{(ii)}$ Logistic regression (LG). $\emph{(iii)}$ Softmax multi-class classification (MC). $\emph{(iv)}$ Multi-label classification with logistic units (MLC), and compare the results with a magnitude pruning-based algorithm for sparsity in FL. Experiments on synthetic data with target density down to $\rho = 0.05$ and publicly available datasets, including e2006-tfidf, RCV1, and MNIST, with target density down to $\rho = 0.005$, demonstrate that our approach consistently works well in both sparsity recovery and statistical performance.
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 19656
Loading