On the Generalization Properties of Learning the Random Feature Models with Learnable Activation Functions
Abstract: This paper studies the generalization properties of a recently proposed kernel method, the Random Feature models with Learnable Activation Functions. By applying a data-dependent sampling scheme for generating features, we provide the sharpest bounds to date on the required number of features for learning these models in both regression and classification tasks. We present a unified theorem that describes the complexity of the feature number, and discuss the results for the plain sampling scheme and the data-dependent leverage weighted scheme. Through weighted sampling, the bound on the feature number in the mean squared error loss case is improved from quadratic dependency on the inverse error to a fractional power in general cases, and even to a constant when the Gram matrix has finite rank. For the Lipschitz loss case, the bound is similarly improved. To learn the weighted models, we also propose an algorithm to find an approximate kernel and then apply the leverage weighted sampling. Empirical results show that the weighted models achieve the same performance with significantly fewer features compared to the plainly sampled models, validating our theories and the effectiveness of this method.
Loading