Approximate Multi-Matrix Multiplication for Streaming Power Iteration Clustering

17 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Power Iteration, Clustering, Streaming, Random Projection
TL;DR: Developed a streaming friendly version of Power Iteration Operator for the purpose of clustering
Abstract: Given a graph, accurately and efficiently detecting the communities present is one of the main challenges in network analysis. In this era, where datasets routinely exceed terabytes in size, many classical algorithms for solving this problem become computationally prohibitive. We address this challenge in the context of the Stochastic Block Model (SBM), which allows for a rigorous analysis. Our approach is a sublinear, updateable, and single-pass approximation to a classic power iteration algorithm \citep{mukherjee2024detecting}. We introduce two sketching-based variants: (1) a \emph{streaming algorithm} for single-pass processing of edge streams, and (2) an \emph{$r$-pass algorithm} that achieves a smaller space embedding at the cost of additional passes equal to the power $r$ of the matrix to be approximated. We show that both methods produce vertex embeddings that guarantee the recovery of the largest cluster when performing single-linkage clustering with an appropriate \emph{separation scale} cut threshold. Our key contribution is a new theoretical analysis of Approximate Multi-Matrix Multiplication (AMMM), which guarantees that the error from repeated compression remains manageable. This framework extends the stable-rank-based approximate matrix multiplication (AMM) guarantees of \citep{cohen2016optimal} to arbitrarily many conforming matrices. We prove that both algorithms preserve the geometric structure needed to identify the largest community using sublinear space in practice. The streaming algorithm (1) scales with the stable rank of the graph matrix for the streaming algorithm, which we show is sublinear in practice. The $r$-pass algorithm achieves the optimal $O(\varepsilon^{-2}\log n)$. Experiments on synthetic graphs confirm that our methods can recover the largest community as effectively as the exact, expensive algorithm, across both balanced and unbalanced communities, but with dramatically lower memory and runtime.
Supplementary Material: zip
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 9960
Loading