Learning One-hidden-layer Neural Networks on Gaussian Mixture Models with Guaranteed Generalizability

Hongkang Li; Shuai Zhang; Meng Wang

Learning One-hidden-layer Neural Networks on Gaussian Mixture Models with Guaranteed Generalizability

Hongkang Li, Shuai Zhang, Meng Wang

28 Sept 2020 (modified: 05 May 2023)ICLR 2021 Conference Blind SubmissionReaders: Everyone

Keywords: neural networks, generalization, Gaussian mixture model, sample complexity, learning algorithm

Abstract: We analyze the learning problem of fully connected neural networks with the sigmoid activation function for binary classification in the teacher-student setup, where the outputs are assumed to be generated by a ground-truth teacher neural network with unknown parameters, and the learning objective is to estimate the teacher network model by minimizing a non-convex cross-entropy risk function of the training data over a student neural network. This paper analyzes a general and practical scenario that the input features follow a Gaussian mixture model of a finite number of Gaussian distributions of various mean and variance. We propose a gradient descent algorithm with a tensor initialization approach and show that our algorithm converges linearly to a critical point that has a diminishing distance to the ground-truth model with guaranteed generalizability. We characterize the required number of samples for successful convergence, referred to as the sample complexity, as a function of the parameters of the Gaussian mixture model. We prove analytically that when any mean or variance in the mixture model is large, or when all variances are close to zero, the sample complexity increases, and the convergence slows down, indicating a more challenging learning problem. Although focusing on one-hidden-layer neural networks, to the best of our knowledge, this paper provides the first explicit characterization of the impact of the parameters of the input distributions on the sample complexity and learning rate.

One-sentence Summary: This paper provides the first theoretical analysis of the impact of the distribution of the input data on the learning performance from the perspective of sample complexity and convergence rate.

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics

Reviewed Version (pdf): https://openreview.net/references/pdf?id=Kfc1GOnZ5_

18 Replies

Loading