Tracking the Evolution of Clusters in Social Media StreamsDownload PDFOpen Website

Published: 01 Jan 2023, Last Modified: 12 May 2023IEEE Trans. Big Data 2023Readers: Everyone
Abstract: Tracking the evolution of clusters in social media streams is becoming increasingly important for many applications, such as early detection and monitoring of natural disasters or pandemics. In contrast to clustering on a static set of data, streaming data clustering does not have a global view of the complete data. The local (or partial) view in a high-speed stream makes clustering a challenging task. In this paper, we propose a novel density peak based algorithm, <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">TStream</monospace> , for tracking the evolution of clusters and outliers in social media streams, via the evolutionary actions of cluster adjustment, emergence, disappearance, split, and merge. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">TStream</monospace> is based on a temporal decay model and text stream summarisation. The decay model captures the decreasing importance of textual documents over time. The stream summarisation compactly represents them with the help of cells ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">aka</i> micro-clusters) in the memory. We also propose a novel efficient index called <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">shared dependency tree</i> ( <italic xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">aka</i> SD-Tree) based on the ideas of density peak and shared dependency. It maintains the dynamic dependency relationships in <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">TStream</monospace> and thereby improves the overall efficiency. We conduct extensive experiments on five real datasets. <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">TStream</monospace> outperforms the existing state-of-the-art solutions based on <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MStream</monospace> , <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MStreamF</monospace> , <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EDMStream</monospace> , <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OSGM</monospace> , and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EStream</monospace> , in terms of cluster mapping measure (CMM) by up to 17.8%, 18.6%, 6.9%, 16.4%, and 20.1%, respectively. It is also significantly more efficient than <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MStream</monospace> , <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">MStreamF</monospace> , <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">OSGM</monospace> , and <monospace xmlns:mml="http://www.w3.org/1998/Math/MathML" xmlns:xlink="http://www.w3.org/1999/xlink">EStream</monospace> , in terms of response time and throughput.
0 Replies

Loading