Effective and Sparse Count-Sketch via k-means clustering

Yuhan Wang; Zijian Lei; Liang Lan

Effective and Sparse Count-Sketch via k-means clustering

Yuhan Wang, Zijian Lei, Liang Lan

Published: 21 Feb 2024, Last Modified: 21 Feb 2024SAI-AAAI2024 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Count-Sketch; $k$-means; Random Projection; Sparse Projection

Abstract: Count-sketch is a popular matrix sketching algorithm that can produce a much smaller sketched matrix of an input data matrix $\mathbf{X}$ in $O(nnz(\mathbf{X}))$ time while preserving most of its properties. Therefore, count-sketch is widely used for addressing high-dimensionality challenge in machine learning. However, count-sketch has two main limitations: (1) The randomly generated sketching matrix used in count-sketch does not consider any intrinsic properties of $\mathbf{X}$. This data-oblivious method could produce a bad sketched matrix which results in low accuracy for subsequent machine learning tasks (e.g., classification); (2) For highly sparse input data, count-sketch could produce a dense sketched data matrix and make the subsequent machine learning tasks more computationally expensive than on the original sparse data $\mathbf{X}$. To a ddress these two limitations, we first show an interesting connection between count-sketch and $k$-means clustering by analyzing the reconstruction error of count-sketch. Based on our analysis, we propose to obtain the low-dimensional sketched matrix by applying $k$-means clustering on the columns of $\mathbf{X}$ and use the cluster centers as the low-dimensional sketched matrix. In addition, to produce a sparse sketched matrix, we propose to solve $k$-mean clustering using gradient descent with $\epsilon$-$\mathcal{L}_1$ ball projection on each iteration. Our experimental results based on six benchmark datasets have demonstrated that our method achieves higher accuracy than the original count-sketch and other matrix sketching algorithms. Our results also demonstrate that our method produces a much sparser sketched data matrix than other methods and therefore the prediction cost of our method is smaller than other methods.

Submission Number: 3

Loading