Evaluation Under Imperfect Benchmarks and Ratings: A Case Study in Text Simplification

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM evaluation, text simplification, synthetic benchmarks, human evaluation, inter-annotator agreement, LLMs-as-a-jury
Abstract: Static benchmarks, or fixed datasets created once and applied repeatedly, are still the default choice for evaluating language models, despite two major challenges. First, static benchmarks rarely reflect evolving model capabilities, often containing outdated examples that are too easy, disfluent, or incoherent. Second, existing human ratings associated with these benchmarks often contain a high degree of disagreement, resulting in inconsistent ratings; existing metrics must nevertheless correlate with these ratings. This hurts evaluation reliability and might break expected trends (e.g., more powerful models being assigned higher scores). We address these challenges, using the task of text simplification as a case study, through two contributions. First, we introduce SynthSimpliEval, a static synthetic benchmark for text simplification featuring simplified sentences generated by models of varying sizes. Through a pilot study, we show that human ratings on our benchmark exhibit high inter-annotator agreement and reflect the expected trend: larger models produce higher-quality simplifications. Second, we show that auto-evaluation with a panel of LLM judges (LLMs-as-a-Jury) often suffices to obtain consistent ratings for the evaluation of text simplification. Overall, through our case study, we show that a reliable evaluation requires higher quality test data in a static benchmark, which could be obtained through careful collection of synthetic data and LLMs-as-a-Jury ratings.
Submission Number: 189
Loading