Linear-Time Algorithms for Representative Subset Selection From Data Streams

Published: 29 Jan 2025, Last Modified: 29 Jan 2025WWW 2025 PosterEveryoneRevisionsBibTeXCC BY 4.0
Track: Web mining and content analysis
Keywords: web data mining, streaming algorithm, data summarization, submodular maximization
Abstract: Representative subset selection from data streams is a critical problem with wide-ranging applications in web data mining and machine learning, such as social media marketing, big data summarization, and recommendation systems. This problem is often framed as maximizing a monotone submodular function subject to a knapsack constraint, where each data element in the stream has an associated cost, and the goal is to select elements within a budget $B$ to maximize revenue. However, existing algorithms typically rely on restrictive assumptions about the costs of data elements, and their performance bounds heavily depend on the budget $B$. As a result, these algorithms are only effective in limited scenarios and have super-linear time complexity, making them unsuitable for large-scale data streams. In this paper, we introduce the first linear-time streaming algorithms for this problem, without any assumptions on the data stream, while also minimizing memory usage. Specifically, our single-pass streaming algorithm achieves an approximation ratio of $1/8-\epsilon$ under $\mathcal{O}(n)$ time complexity and $\mathcal{O}(k\log\frac{1}{\epsilon})$ space complexity, where $k$ is the largest cardinality of any feasible solution. Our multi-pass streaming algorithm improves this to a $(1/2-\epsilon)$-approximation using only three passes over the stream, with $\mathcal{O}(\frac{n}{\epsilon}\log\frac{1}{\epsilon})$ time complexity and $\mathcal{O}(\frac{k}{\epsilon}\log\frac{1}{\epsilon})$ space complexity. Extensive experiments across various applications related to web data mining and social media marketing demonstrate the superiority of our algorithms in terms of both effectiveness and efficiency.
Submission Number: 612
Loading