Abstract: The sheer scale of big data causes the information overload issue and there is an urgent need for tools that can draw valuable insights from massive data. This paper investigates the core items tracking (CIT) problem where the goal is to continuously track representative items, called core items, in a data stream so to best represent/summarize the stream. In order to simultaneously satisfy the recency and continuity requirements, we consider CIT over probabilistic-decaying streams where items in the stream are forgotten gradually in a probabilistic manner. We first introduce an algorithm, called PNDCIT, to find core items in a special kind of probabilistic non-decaying streams. Furthermore, using PNDCIT as a building block, we design two novel algorithms, namely PDCIT and PDCIT+, to maintain core items over probabilistic-decaying streams with constant approximation ratios. Finally, extensive experiments on real data demonstrate that PDCIT+ achieves a speedup of up to one order of magnitude over a batch algorithm while providing solutions with comparable quality.
Loading