Scalable Kernel $k$-Means With Randomized Sketching: From Theory to Algorithm

Rong Yin, Yong Liu, Weiping Wang, Dan Meng

Published: 01 Jan 2023, Last Modified: 20 Nov 2023IEEE Trans. Knowl. Data Eng. 2023Readers: Everyone

Abstract: Kernel <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means is a fundamental unsupervised learning in data mining. Its computational requirements are typically at least quadratic in the number of data, which are prohibitive for large-scale scenarios. To address these issues, we propose a novel randomized sketching approach SKK based on the circulant matrix. SKK projects the kernel matrix left and right according to the proposed sketch matrices to obtain a smaller one and accelerates the matrix-matrix product by the fast Fourier transform based on the circulant matrix, which can greatly reduce the computational requirements of the approximate kernel <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means estimator with the same generalization bound as the exact kernel <inline-formula><tex-math notation="LaTeX">$k$</tex-math></inline-formula> -means in the statistical setting. In particular, theoretical analysis shows that taking the sketch dimension of <inline-formula><tex-math notation="LaTeX">$\sqrt{n}$</tex-math></inline-formula> is sufficient for SKK to achieve the optimal excess risk bound with only a fraction of computations, where <inline-formula><tex-math notation="LaTeX">$n$</tex-math></inline-formula> is the number of data. The extensive experiments verify our theoretical analysis, and SKK achieves the state-of-the-art performances on 12 real-world datasets. To the best of our knowledge, in randomized sketching, this is the first time that unsupervised learning makes such a significant breakthrough.

0 Replies