The Likelihood Gain of a Language Model as a Metric for Text Summarization

Published: 15 Apr 2024, Last Modified: 07 May 2024Learn to Compress @ ISIT 2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: summarization, lossless compression, logarithmic loss, semantic compression, log-likelihood gain
TL;DR: The gain in log-likelihood of a text under a language model when provided with a summary is shown to be a useful metric for summarization, both theoretically and empirically.
Abstract: Consider the gain in the log-likelihood of a text under a language model (LM) when the text's summary is provided as a context to the LM compared to no summary in the context. This gain has been proposed as a reference-free index of similarity for evaluating the relevance of the summary to the text. We strengthen this proposal by providing an information-theoretic interpretation of this index and an empirical analysis of the parts of speech affecting this index the most. Specifically, we first show that this similarity index describes the reduction in the binary codelength when the summary text is provided as side information to a lossless text compression system involving the LM and an entropy encoder. Consequently, under proper normalization, this similarity index leads to a form of the well-studied Normalized Compression Distance (NCD) and thus adheres to a universal measure of information distance. Empirical results show that an NCD based on the log-likelihood gain (LLG) is better correlated with human annotators than a gzip-based NCD, although the two are relatively highly correlated with each other. Additionally, we empirically show that LLG is affected almost exclusively by tokens associated with the text's content rather than tokens associated with its structure. This observation supports LLG as a natural and useful index of similarity for evaluating and designing text summarization methods.
Submission Number: 2
Loading