Abstract: This paper investigates the problem of certifying optimality for sparse generalized linear models (GLMs), where sparsity is enforced through an $\ell_0$ cardinality constraint. While branch-and-bound (BnB) frameworks can certify optimality by pruning nodes using dual bounds, existing methods for computing these bounds are either computationally intensive or exhibit slow convergence, limiting their scalability to large-scale problems. To address this challenge, we propose a first-order proximal gradient algorithm designed to solve the perspective relaxation of the problem within a BnB framework. Specifically, we formulate the relaxed problem as a composite optimization problem and demonstrate that the proximal operator of the non-smooth component can be computed exactly in log-linear time complexity, eliminating the need to solve a computationally expensive second-order cone program. Furthermore, we introduce a simple restart strategy that enhances convergence speed while maintaining low per-iteration complexity. Extensive experiments on synthetic and real-world datasets show that our approach significantly accelerates dual bound computations and is highly effective in providing optimality certificates for large-scale problems.
Lay Summary: While continuous optimization has made great strides thanks to fast algorithms and GPU acceleration, discrete optimization — where some variables must take integer values — has lagged behind. In this work, we ask: can new tools from continuous optimization and modern hardware help scale discrete problems to much larger sizes? We focus on sparse generalized linear models (GLMs), where the goal is to find simple, accurate models using only "$k$" predictive features. This constraint makes the problem discrete and challenging, especially when optimality must be guaranteed.
Our key contribution is a fast, scalable algorithm for computing tight lower bounds — a critical step in the standard framework, *i.e.*, the branch-and-bound algorithm. What’s surprising is that we discovered hidden mathematical structure in the model that allows us designing a first-order optimization method with three rare properties: (1) each step is very simple and only involves matrix-vector multiplication (making it GPU-friendly), (2) the method converges at the fastest possible rate (linear convergence rate) for this class of algorithms, and (3) each step is computationally cheap in practice.
Our work makes it much faster to solve these problems and opens new possibilities for interpretable machine learning in high-stakes settings. It enables researchers to build sparse models that are optimal for large-scale datasets, potentially allowing us to identify key biomarkers in medical diagnosis or discover differential equations from empirical data in physics.
Primary Area: Optimization->Large Scale, Parallel and Distributed
Keywords: generalized linear models, sparse learning, proximal method, mixed-integer programming
Submission Number: 14502
Loading