Relating Romanized Comments to News Articles by Inferring Multi-glyphic Topical Correspondence
Abstract: Commenting is a popular facility provided by news sites. An
alyzing such user-generated content has recently attracted re
search interest. However, in multilingual societies such as In
dia, analyzing such user-generated content is hard due to sev
eral reasons: (1) There are more than 20 official languages
but linguistic resources are available mainly for Hindi. It is
observed that people frequently use romanized text as it is
easy and quick using an English keyboard, resulting in multi
glyphic comments, where the texts are in the same language
but in different scripts. Such romanized texts are almost un
explored in machine learning so far. (2) In many cases, com
ments are made on a specific part of the article rather than
the topic of the entire article. Off-the-shelf methods such as
correspondence LDA are insufficient to model such relation
ships between articles and comments. In this paper, we ex
tend the notion of correspondence to model multi-lingual,
multi-script, and inter-lingual topics in a unified probabilistic
model called the Multi-glyphic Correspondence Topic Model
(MCTM). Using several metrics, we verify our approach and
show that it improves over the state-of-the-art.
Loading