Abstract: Standard evaluation of automated text summarization (ATS) methods relies on manually crafted golden summaries. With the advances in Large Language Models (LLMs), it is legitimate to question whether these models can now potentially complement or replace human-crafted summaries. This study examines the effectiveness of several language models (LMs) in specifically addressing the issue of preserving factual consistency. By conducting a thorough assessment of various conventional and state-of-the-art performance metrics, such as ROUGE, BLEU, BERTScore, FActScore, and LongDocFACTScore across diverse datasets, our findings highlight the important relationship between linguistic eloquence and factual accuracy. The findings suggest that whereas LLMs, such as GPT and LLaMA, demonstrate considerable competence in producing concise and contextually-aware summaries, there remain difficulties in ensuring factual accuracy, particularly in domain-specific situations. Moreover, this work enhances the existing knowledge on summarization dynamics and highlights the need of developing more reliable and tailored evaluation techniques that minimize the probability of factual errors in text generated by ATS. In particular, the findings advance the current domain by providing a rigorous assessment of the balance between linguistic fluency and factual correct- ness, highlighting the limitations of current ATS frameworks and metrics to enhance the factual reliability of LM-generated summaries.
Loading