LiSAScore: Exploring Linear Sum Assignment on BertScore

Published: 01 Jan 2024, Last Modified: 13 Nov 2024NLDB (2) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Metrics play a crucial role in evaluating the performance of machine learning models. In the context of Natural Language Processing (NLP) tasks, such as text summarization and machine translation, Natural Language Generation (NLG) metrics such as Bleu and Rouge have been widely used. However, these metrics are based on n-gram matching and do not capture the semantic similarity between the generated and reference texts. To address this, BertScore has emerged as a popular evaluation metric that uses a pre-trained Large Language Model (LLM) to measure semantic similarity between two sentences. Unlike n-gram-based metrics, BertScore uses the contextual and semantic embeddings of words, allowing flexible semantic evaluation. We outline a number of hypotheticals in which the dependence of BertScore on token embedding cosine similarity may be exploited. The comparative distribution of BertScores on a set of reference - prediction pairs mean that results often scale differently with training to traditional metrics, which requires more expertise when interpreting results.
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview