Abstract: Metrics play a crucial role in evaluating the performance of machine learning models. In the context of Natural Language Processing (NLP) tasks, such as text summarization and machine translation, Natural Language Generation (NLG) metrics such as Bleu and Rouge have been widely used. However, these metrics are based on n-gram matching and do not capture the semantic similarity between the generated and reference texts. To address this, BertScore has emerged as a popular evaluation metric that uses a pre-trained Large Language Model (LLM) to measure semantic similarity between two sentences. Unlike n-gram-based metrics, BertScore uses the contextual and semantic embeddings of words, allowing flexible semantic evaluation. We outline a number of hypotheticals in which the dependence of BertScore on token embedding cosine similarity may be exploited. The comparative distribution of BertScores on a set of reference - prediction pairs mean that results often scale differently with training to traditional metrics, which requires more expertise when interpreting results.
Loading