Small Coresets to Represent Large Training Data for Support Vector Machines


Nov 03, 2017 (modified: Dec 11, 2017) ICLR 2018 Conference Blind Submission readers: everyone Show Bibtex
  • Abstract: Support Vector Machines (SVMs) are one of the most popular algorithms for classification and regression analysis. Despite their popularity, even efficient implementations have proven to be computationally expensive to train at a large-scale, especially in streaming settings. In this paper, we propose a novel coreset construction algorithm for efficiently generating compact representations of massive data sets to speed up SVM training. A coreset is a weighted subset of the original data points such that SVMs trained on the coreset are provably competitive with those trained on the original (massive) data set. We provide both lower and upper bounds on the number of samples required to obtain accurate approximations to the SVM problem as a function of the complexity of the input data. Our analysis also establishes sufficient conditions on the existence of sufficiently compact and representative coresets for the SVM problem. We empirically evaluate the practical effectiveness of our algorithm against synthetic and real-world data sets.
  • TL;DR: We present an algorithm for speeding up SVM training on massive data sets by constructing compact representations that provide efficient and provably approximate inference.
  • Keywords: coresets, data compression