Abstract: Word usage is influenced by diverse factors, including topic, genre and various speaker/author characteristics. To characterize these aspects of language, we introduce the “Multi-Factor Sparse Plus Low Rank” exponential language model, which allows supervised joint training of arbitrary overlapping factor-specific model components. This flexible architecture has the advantage of being highly interpretable. The elements of sparse parameter matrices can be viewed as factor-dependent corrections (e.g. topic- or speaker-dependent phenomena). In topic modeling experiments on conversational telephone speech, we obtain modest perplexity reductions over an n-gram baseline and demonstrate topic-dependent keyword extraction that leads to a 13% (absolute) improvement in precision over TFIDF. We also show how keywords can be jointly learned for speakers, roles and topics in a study of Supreme Court oral arguments.
0 Replies
Loading