Characterizing Evolution in Expectation-Maximization Estimates for Overspecified Mixed Linear Regression

Published: 14 Jan 2026, Last Modified: 14 Jan 2026Accepted by TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Estimating data distributions using parametric families is crucial in many learning setups, serving both as a standalone problem and an intermediate objective for downstream tasks. Mixture models, in particular, have attracted significant attention due to their practical effectiveness and comprehensive theoretical foundations. A persisting challenge is model misspecification, which occurs when the model to be fitted has more mixture components than those in the data distribution. In this paper, we develop a theoretical understanding of the Expectation-Maximization (EM) algorithm's behavior in the context of targeted model misspecification for overspecified two-component Mixed Linear Regression (2MLR) with unknown $d$-dimensional regression parameters and mixing weights. In Theorem 5.1 at the population level, with an unbalanced initial guess for mixing weights, we establish linear convergence of regression parameters in $\mathcal{O}(\log (1/\epsilon))$ steps. Conversely, with a balanced initial guess for mixing weights, we observe sublinear convergence in $\mathcal{O}(\epsilon^{-2})$ steps to achieve the $\epsilon$-accuracy at Euclidean distance. In Theorem 6.1 at the finite-sample level, for mixtures with sufficiently unbalanced fixed mixing weights, we demonstrate a statistical accuracy of $\mathcal{O}((d/n)^{1/2})$, whereas for those with sufficiently balanced fixed mixing weights, the accuracy is $\mathcal{O}((d/n)^{1/4})$ given $n$ data samples. Furthermore, we underscore the connection between our population level and finite-sample level results: by setting the desired final accuracy $\epsilon$ in Theorem 5.1 to match that in Theorem 6.1 at the finite-sample level, namely letting $\epsilon = \mathcal{O}((d/n)^{1/2})$ for sufficiently unbalanced fixed mixing weights and $\epsilon = \mathcal{O}((d/n)^{1/4})$ for sufficiently balanced fixed mixing weights, we intuitively derive iteration complexity bounds $\mathcal{O}(\log (1/\epsilon))=\mathcal{O}(\log (n/d))$ and $\mathcal{O}(\epsilon^{-2})=\mathcal{O}((n/d)^{1/2})$ at the finite-sample level for sufficiently unbalanced and balanced initial mixing weights, respectively. We further extend our analysis in the overspecified setting to the finite low SNR regime, providing approximate dynamic equations that characterize the EM algorithm's behavior in this challenging case. Our new findings not only expand the scope of theoretical convergence but also improve the bounds for statistical error, time complexity, and sample complexity, and rigorously characterize the evolution of EM estimates.
Submission Type: Long submission (more than 12 pages of main content)
Changes Since Last Submission: We have significantly expanded our motivation section to demonstrate the broader applicability of our work. Specifically, we revised the motivating example of Phase Retrieval to include a detailed theoretical analysis and refined the mathematical formulation of the Haplotype Assembly example. We also introduced two additional motivating examples on pages 5–7 discussing learning overparameterized models and Mixture of Experts to provide wider context. To further improve the paper's accessibility, we added intuitive explanations for our main results, Theorem 5.1 and Theorem 6.1, on pages 10 and 13 respectively. Finally, we corrected typographical errors in the remark on proof for Approximate Dynamic Equations (Proposition 4.4) and fixed the indices from $\alpha^t, \beta^t$ to $\alpha^{t+1}, \beta^{t+1}$ in Appendix B, Subsection B.3.
Code: https://github.com/dassein/em_overspecified_mlr
Supplementary Material: zip
Assigned Action Editor: ~Arya_Mazumdar1
Submission Number: 5628
Loading