Keywords: Mean Square Error, Local Optimum, Linear Algebra
TL;DR: We provide a theoretical analysis of local optima in MSE loss and propose a principled optimizer that avoids traps, offering both mathematical guarantees and improved training performance.
Abstract: Deep learning models are trained by minimizing loss functions such as mean squared error (MSE) or cross-entropy, but these objectives are highly non-convex. As a result, optimization often encounters local optima, saddle points, or sharp valleys that hinder convergence and generalization. Although many heuristic approaches, such as momentum, Adam, help mitigate these issues, they provide limited theoretical understanding.
In this work, we present a theoretical study of the optimization of MSE. We first provide a mathematical characterization of local optima under MSE and contrast them with those of cross-entropy, identifying when and how they arise. Building on this analysis, we introduce a modified optimization algorithm that explicitly accounts for these properties. Unlike heuristic methods, our approach offers theoretical guarantees for avoiding spurious local traps.
Our experiments show that the proposed method reliably avoids local optima and converges more effectively than existing optimizers in MNIST, CIFAR10 and CIFAR100 with simple CNN. Our work provides both new insight into MSE optimization for training deep networks.
Primary Area: optimization
Submission Number: 10530
Loading