RetriEVAL: Evaluating Text Generation with Contextualized Lexical Match

Published: 2025, Last Modified: 20 Jan 2026WSDM 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Pre-trained language models have made significant advancements in text generation tasks. Nevertheless, evaluating the generated text with automatic metrics is still challenging. Compared with supervised metrics, unsupervised metrics which are known for generality and robustness, are frequently employed to assess the quality of generated text efficiently. The representative unsupervised metric BERTScore uses pretrained embedding to calculate the word-to-word similarity across all tokens as evaluation scores, which can introduce potential noise due to the inclusion of tokens that do not contribute significantly to the semantics of the text. Furthermore, its heavy reliance on dense embeddings may lead to lower accuracy when evaluating text outside the common contexts represented in the training data, making it less effective in handling uncommon linguistic patterns Additionally, BERTScore treats all tokens with equal importance and lacks the ability to perform meaningful contextual expansion, which can result in less accurate similarity measurements, particularly when dealing with paraphrased or semantically rich text. To address this problem, we propose an unsupervised automatic evaluation metric inspired by the concept of lexical match in information retrieval. Our method leverages contextualized lexical matching to measure exact matches between identical tokens and dynamically matches different tokens based on their contextualized representations. Experiments on SummEval and Topical-Chat demonstrate our proposed RetriEVAL can correlate better with human judgments than previous unsupervised metrics.
Loading