- Abstract: Many applications require semantic understanding of short texts, and inferring discriminative and coherent latent topics is a critical and fundamental task in these applications. Conventional topic models largely rely on word co-occurrences to derive topics from a collection of documents. However, due to the length of each document, short texts are much more sparse in terms of word co-occurrences. Recent studies show that the Dirichlet Multinomial Mixture (DMM) model is effective for topic inference over short texts by assuming that each piece of short text is generated by a single topic. However, DMM has two main limitations. First, even though it seems reasonable to assume that each short text has only one topic because of its shortness, the definition of “shortness” is subjective and the length of the short texts is dataset dependent. That is, the single-topic assumption may be too strong for some datasets. To address this limitation, we propose to model the topic number as a Poisson distribution, allowing each short text to be associated with a small number of topics (e.g., one to three topics). This model is named PDMM. Second, DMM (and also PDMM) does not have access to background knowledge (e.g., semantic relations between words) when modeling short texts. When a human being interprets a piece of short text, the understanding is not solely based on its content words, but also their semantic relations. Recent advances in word embeddings offer effective learning of word semantic relations from a large corpus. Such auxiliary word embeddings enable us to address the second limitation. To this end, we propose to promote the semantically related words under the same topic during the sampling process, by using the generalized Pólya urn (GPU) model. Through the GPU model, background knowledge about word semantic relations learned from millions of external documents can be easily exploited to improve topic modeling for short texts. By directly extending the PDMM model with the GPU model, we propose two more effective topic models for short texts, named GPU-DMM and GPU-PDMM. Through extensive experiments on two real-world short text collections in two languages, we demonstrate that PDMM achieves better topic representations than state-of-the-art models, measured by topic coherence. The learned topic representation leads to better accuracy in a text classification task, as an indirect evaluation. Both GPU-DMM and GPU-PDMM further improve topic coherence and text classification accuracy. GPU-PDMM outperforms GPU-DMM at the price of higher computational costs.
- Artifact Type: Code
- Requested Badges: Artifacts Available