Nonconvex Optimization and Model Representation with Applications in Control Theory and Machine Learning

Yue Sun

30 Sept 2022OpenReview Archive Direct UploadReaders: Everyone

Abstract: In control and machine learning, the primary goal is to learn the models that make predictions or decisions and act in the world. This thesis covers two important aspects for control theory and machine learning: the model structure that allows low training and generalization error with few samples (i.e., low sample complexity), and convergence guarantees for first-order optimization algorithms for nonconvex optimization. If the model and the training algorithm apply the knowledge of the structure of data (such as sparsity, low-rankness, etc.), the model can be learned with low sample complexity. We present two results, the Hankel nuclear norm regularization method for learning a low order system, and the overparameterized representation for linear meta-learning. We study dynamical system identification in the first result. We assume the true system order is low. A low system order means that the state can be represented by a low dimensional vector, and the system corresponds to a low rank Hankel matrix. The low-rankness is known to be encouraged by nuclear norm regularized estimator in matrix completion theory. We apply a nuclear norm regularized estimator for Hankel matrix, and show that it requires fewer samples than the ordinary least squares estimator. We study linear meta-learning in the second part. The meta-learning algorithm contains two steps: learning a large model in representation learning stage, and fine tuning the model in few-shot learning stage. The few-shot dataset contains few samples, and to avoid overfitting, we need a fine-tuning algorithm that uses the information from representation learning. We generalize the subspace-based model in prior arts to Gaussian model, and describe the overparameterized meta-learning procedure. We show that the feature-task alignment reduces the sample complexity in representation learning, and the optimal task representation is overparameterized. First order optimization methods such as gradient based method, is widely used in machine learning thanks to its simplicity for implementation and fast convergence. However, the objective function in machine learning can be nonconvex, and the first order method has only the theoretical guarantee that it converges to a stationary point, rather than a local/global minimum. We dive into more refined analysis of the convergence guarantee, and present two results, the convergence of perturbed gradient descent approach to a local minimum on Riemannian manifold, and a unified global convergence result of policy gradient descent for linear system control problems. We study how Riemannian gradient converges to an approximate local minimum in the first part. While it is well-known that the perturbed gradient descent escapes saddle points in Euclidean space, less is known about the concrete convergence rate when we apply Riemannian gradient descent on the manifold. In the first result, we show that the perturbed Riemannian gradient descent converges to an approximate local minimum and reveal the relation between convergence rate and the manifold curvature. We study the policy gradient descent applied in control in the second part. Many control problems are revisited under the context of the recent boom in reinforcement learning (RL), however, there is a gap between the RL and control methodology: The policy gradient in RL applies first-order method on nonconvex landscape, and it is hard to show they converge to global minimum, while control theory invents reparameterization that makes the problem convex and they are proven to find the globally optimal controller in polynomial time. Targeting on interpreting the success of the nonconvex method, in the second result, we connect the nonconvex policy gradient descent applied for a collection of control problems with their convex parameterization, and propose a unified proof for the global convergence of policy gradient descent.

0 Replies