An Application of Pseudo-log-likelihoods to Natural Language Scoring

Darren Abramson; Ali Emami

An Application of Pseudo-log-likelihoods to Natural Language Scoring

Darren Abramson, Ali Emami

Published: 28 Jan 2022, Last Modified: 12 Oct 2025ICLR 2022 SubmittedReaders: Everyone

Abstract: Language models built using semi-supervised machine learning on large corpora of natural language have very quickly enveloped the fields of natural language generation and understanding. In this paper we apply a zero-shot approach in- dependently developed by several researchers now gaining recognition as a significant alternative to fine-tuning for evaluation on common sense tasks. A language model with relatively few parameters and training steps (albert-xxlarge-v2) compared to a more recent language model (T5) can outperform it on a recent large data set (TimeDial), while displaying robustness in its performance across a similar class of language tasks. Surprisingly, this result is achieved by using a hyperparameter-free zero-shot method with the smaller model, compared to fine-tuning to the larger model. We argue that robustness of the smaller model ought to be understood in terms of compositionality, in a sense that we draw from re- cent literature on a class of similar models. We identify a practical cost for our method and model: high GPU-time for natural language evaluation. The zero-shot measurement technique that produces remarkable stability, both for ALBERT and other BERT variants, is an application of pseudo-log-likelihoods to masked language models for the relative measurement of probability for substitution alter- natives in forced choice language tasks such as the Winograd Schema Challenge, Winogrande, CommonsenseQA, and others. One contribution of this paper is to bring together a number of similar, but independent strands of research. We produce some absolute state-of-the-art (SOTA) results for common sense reasoning in binary choice tasks, performing better than any published result in the literature, including fine-tuned efforts. In others our results are SOTA relative to published methods similar to our own – in some cases by wide margins, but below SOTA absolute for fine-tuned alternatives. In addition, we show a remarkable consistency of the model’s performance under adversarial settings, which we argue is best explained by the model’s compositionality of representations.

One-sentence Summary: A recent method for scoring sentences with language models shows SOTA performance on a number of benchmarks with an Albert variant.

Community Implementations: [![CatalyzeX](/images/catalyzex_icon.svg) 3 code implementations](https://www.catalyzex.com/paper/an-application-of-pseudo-log-likelihoods-to/code)

13 Replies

Loading