TL;DR: We propose contextual temperature, a mechanism that enables temperature scaling for language models based on the context of each word. Contextual temperature co-adapts with model parameters and can be learned during training.
Abstract: Temperature scaling has been widely used to improve performance for NLP tasks that utilize Softmax decision layer. Current practices in using temperature either assume a fixed value or a dynamically changing temperature but with a fixed schedule. Little has been known on an optimal trajectory of temperature that can change with the context. In this paper, we propose contextual temperature, a mechanism that allows temperatures to change over the context for each vocabulary, and to co-adopt with model parameters during training. Experimental results illustrated that contextual temperature improves over state-of-the-art language models significantly. Our model CT-MoS achieved a perplexity of 55.31 in the test set of Penn Treebank and a perplexity of 62.89 in the test set of WikiText-2. The in-depth analysis showed that the behavior of temperature schedule varies dramatically by vocabulary. The optimal temperature trajectory drops as the context becomes longer to suppress uncertainties in language modeling. These evidence further justified the need for contextual temperature and explained its performance advantage over fixed temperature or scheduling.
Keywords: natural language processing, language modeling, sequence modeling, temperature scaling
Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 1 code implementation](https://www.catalyzex.com/paper/arxiv:2012.13575/code)
Original Pdf: pdf
9 Replies
Loading