Scaling Up Word Clustering

Jon Dehdari, Liling Tan, Josef van Genabith

2016 (modified: 16 Jul 2019)HLT-NAACL Demos 2016Readers: Everyone

Abstract: Word clusters improve performance in many NLP tasks including training neural network language models, but current increases in datasets are outpacing the ability of word clusterers to handle them. In this paper we present a novel bidirectional, interpolated, refining, and alternating (BIRA) predictive exchange algorithm and introduce ClusterCat, a clusterer based on this algorithm. We show that ClusterCat is 3‐85 times faster than four other well-known clusterers, while also improving upon the predictive exchange algorithm’s perplexity by up to 18% . Notably, ClusterCat clusters a 2.5 billion token English News Crawl corpus in 3 hours. We also evaluate in a machine translation setting, resulting in shorter training times achieving the same translation quality measured in BLEU scores. ClusterCat is portable and freely available.

0 Replies