Benchmarking Large Language Model Benchmarks: Popular Benchmarks vs. Human Perception

11 Sept 2025 (modified: 08 Jan 2026)ICLR 2026 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LMArena, Ranking, human perception
Abstract: Benchmarks play a critical role as a measure of large language model (LLM) capabilities. However, whether LLM performance on benchmarks is similar to their real-world performance, especially human perception of their outputs, remains questionable. This study specifically focuses on whether \textbf{LLM performance on benchmarks is similar to human perception}. The study investigates this gap by quantifying the similarity between LLM rankings derived from benchmarks and LLM rankings generated from human votes on the prominent LMArena platform. It systematically compares benchmark rankings against rankings in corresponding task-specific categories in LMArena for over 100 top-tier LLMs. The findings reveal that LLM performance on several popular benchmarks has low similarity with human perception, even though these benchmarks are more recent or challenging. The results highlight limitations in current benchmarking practices and underscore the need for evaluation frameworks that more accurately reflect the human perception and real-world performance of LLMs.
Supplementary Material: zip
Primary Area: alignment, fairness, safety, privacy, and societal considerations
Submission Number: 4116
Loading