Large-scale logistic regression and linear support vector machines using spark

Chieh-Yen Lin, Cheng-Hao Tsai, Ching-Pei Lee, Chih-Jen Lin

2014 (modified: 30 Jan 2022)IEEE BigData 2014Readers: Everyone

Abstract: Logistic regression and linear SVM are useful methods for large-scale classification. However, their distributed implementations have not been well studied. Recently, because of the inefficiency of the MapReduce framework on iterative algorithms, Spark, an in-memory cluster-computing platform, has been proposed. It has emerged as a popular framework for large-scale data processing and analytics. In this work, we consider a distributed Newton method for solving logistic regression as well linear SVM and implement it on Spark. We carefully examine many implementation issues significantly affecting the running time and propose our solutions. After conducting thorough empirical investigations, we release an efficient and easy-to-use tool for the Spark community.

0 Replies