Position: Evaluations Should Acknowledge Model Multifacetedness in the Era of Large Language Models

02 May 2025 (modified: 29 Oct 2025)Submitted to NeurIPS 2025 Position Paper TrackEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Large Language Model
TL;DR: This paper argues that a fundamental reconceptualization of the "model" itself is necessary to increase the quality of LLMs' evaluation.
Abstract: The rapid evolution of Artificial Intelligence (AI), particularly Large Language Models (LLMs), marks a significant departure from earlier machine learning (ML) paradigms. This advancement has exposed critical misconceptions in our understanding of the "model" itself, especially evident in evaluation methodologies that often rely on narrow observational windows to assess overall model quality. This paper argues that a fundamental reconceptualization of the "model" itself is necessary to address this evaluative crisis. We introduce a five-tiered hierarchical framework. Specifically, we divide models into: Noumenal, Conceptual, Instantiated, Reachable, and Observable ones. Using this framework, we examine the historical development of how models have been conceptualized and evaluated within the ML field, analyzing the roles of experiments, ablation studies, and datasets. The paper further argues that LLMs' current development fundamentally challenges these long-standing evaluation patterns, as existing benchmarks and metrics increasingly fail to capture the true capabilities and limitations of these complex models. Our primary contribution is to consolidate and structure many of these historical insights and evolving challenges. By organizing these often fragmented pieces of understanding into the proposed five-tiered hierarchical framework, we aim to offer a more cohesive and systematic lens for approaching AI model evaluation. We believe that such a structured approach, which encourages assessment strategies to be explicitly contextualized by a model's position within this hierarchy and informed by its preceding layer, can help cultivate a more robust and meaningful comprehension of these increasingly complex LLM systems.
Submission Number: 50
Loading