Abstract: Twitter hashtags provide a high-level summary of tweets, while cluster hashtags have many applications. Existing text-based methods (relying on explicit words in tweets) are greatly affected by the sparsity of the short tweet texts and the low co-occurrence rates of hashtags in tweets. Meanwhile, semantically related hashtags but using different text-expressions may show similar temporal patterns (i.e., the frequencies of hashtag usages changing with the time), which can help capture events, opinions and synonyms. In this paper, we propose a novel clustering hashtags by their temporal patterns (CHTP) method as a complement to text-based methods. In CHTP, hashtags are represented as hashtag time series that show their temporal patterns, so, hashtag clusters can be discovered by clustering hashtag time series. Density-based clustering algorithms are suitable to discover naturally shaped hashtag clusters but they are not fine enough (use one distance threshold to define density) to differentiate clusters of various density levels. Therefore, we develop a new parameter-free Density-Sensitive Clustering (DSC) algorithm to discover clusters of different density levels and use it in CHTP to group hashtags by temporal patterns. DSC recursively partitions the dataset from coarse-grained to fine-grained (using adaptive distance thresholds) to discover hashtag clusters of different density levels. Experiments conducted on Twitter datasets show that the DSC algorithm finds hashtag clusters of different densities more effectively than counterpart methods, and CHTP (using DSC) can discover meaningful hashtag clusters, 36% of which cannot be found by the text-based approaches.
0 Replies
Loading