Keywords: large language models, benchmarks, documentation, standardization, evaluation, transparency
TL;DR: BenchNavigator helps practitioners find and compare LLM benchmarks by standardizing fragmented metadata from multiple sources into a single, explainable interface based on their needs
Abstract: With the deployment of large language models (LLMs) continuously growing in diverse settings, benchmarks are essential for evaluating their capabilities and comparing performance. However, the their proliferation creates a significant challenge for users, making it difficult to discover available benchmarks and compare their strengths and weaknesses in a standardized way. Thousands of benchmarks now exist, each presenting its metadata, such as the tasks it covers, the metrics it uses, and its documentation in different formats. This fragmentation of metadata imposes a considerable cognitive burden on practitioners, who must navigate multiple sources with inconsistent representations to find appropriate benchmarks for their LLM evaluation processes. Although recent efforts have improved individual benchmark documentation and quality assessment, they do not address a fundamental problem: the lack of a uniform method for viewing and comparing benchmarks for selection decisions. Through a survey of practitioners and an analysis of benchmark metadata, we characterize the information that needs to be unified for effective benchmark discovery. We identify critical metadata that practitioners require, document the cognitive costs of current fragmented approaches, and reveal systematic inconsistencies in how benchmarks are presented across sources. To address these challenges, we develop BenchAdvisor, a system that provides a unified interface for benchmark discovery by normalizing heterogeneous metadata into a coherent representation. Our prototype demonstrates that practitioner-identified requirements can be made operational in a functional system designed to reduce the cognitive burden of benchmark selection. This work provides the first empirical foundation for harmonizing benchmark metadata presentation and establishes design requirements for unified discovery tools.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Type: Research Paper
Archival Status: Archival
Submission Number: 63
Loading