Learning Polylingual Topic Models from Code-Switched Social Media Documents

Nanyun Peng, Yiming Wang, Mark Dredze

2014 (modified: 16 Jul 2019)ACL (2) 2014Readers: Everyone

Abstract: Code-switched documents are common in social media, providing evidence for polylingual topic models to infer aligned topics across languages. We present Code-Switched LDA (csLDA), which infers language specific topic distributions based on code-switched documents to facilitate multi-lingual corpus analysis. We experiment on two code-switching corpora (English-Spanish Twitter data and English-Chinese Weibo data) and show that csLDA improves perplexity over LDA, and learns semantically coherent aligned topics as judged by human annotators.

0 Replies