BenchHub: A Unified Benchmark Suite for Holistic and Customizable LLM Evaluation

Published: 16 Oct 2025, Last Modified: 10 Nov 2025NeurIPS 2025 ER WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Benchmark Suite, Evaluation
Abstract: As large language models (LLMs) with advanced reasoning abilities continue to evolve, their capabilities are increasingly tested across heterogeneous contexts. To evaluate them effectively, benchmarks must move beyond fragmented datasets and narrow rankings, addressing the growing need to capture abilities that integrate multiple skills (e.g., reasoning and knowledge) across diverse domains (e.g., mathematics and culture). This complexity calls for a new paradigm of evaluation—flexible, domain-aware, and continuously updated. In this paper, we introduce BenchHub, a dynamic benchmark repository that empowers researchers and developers to evaluate LLMs effectively, with a focus on Korean and English. BenchHub aggregates and automatically classifies benchmark datasets from diverse domains, integrating 839k questions across 54 benchmarks. It is designed to support continuous updates and scalable data management, enabling flexible and customizable evaluation tailored to various domains or use cases. Through extensive experiments with various LLM families, we demonstrate that model performance varies significantly across domain-specific subsets, emphasizing the importance of domain-aware benchmarking. Furthermore, we extend BenchHub into 10 languages spanning resource levels. We believe BenchHub can encourage better dataset reuse, more transparent model comparisons, and easier identification of underrepresented areas in existing benchmarks, offering a critical infrastructure for advancing LLM evaluation research.
Submission Number: 278
Loading