Abstract: Huge amount of user generated contents are being created in major social media everyday. It’s very difficult for users to quickly grasp the most important topics from such big data. Document clustering techniques are often used for topic detection from news, but it’s still challenging for social media. Firstly, since social media posts are usually very short, it’s hard to capture their semantic meanings. Secondly, given huge amount of social media posts, clustering effectiveness becomes unacceptable. Among agglomerative hierarchical clustering methods, centroid clustering is much more efficient, but with the issue of inversion or reversals. In this paper, we propose to detect popular topics from social media posts using dot-product similarity based centroid clustering of their word embeddings. Firstly, we extract keywords in posts with word segmentation, where documents are represented by word embedding of keywords. Secondly, various topics are extracted from the clustering results of social media posts by calculating dot-product similarity of their word embeddings. Finally, the popularity of each topic is estimated by the aggregate sentiment ratings from user replies. From the experimental results on PTT discussion forum, centroid clustering with dot-product similarity of word embeddings achieves better clustering efficiency with comparable effectiveness in terms of Adjusted Rand Index (ARI) and Adjusted Mutual Information (AMI). Further investigation is needed to verify the effectiveness in different social media sources.
0 Replies
Loading