Train-before-Test Harmonizes Language Model Rankings

Published: 24 Sept 2025, Last Modified: 24 Sept 2025NeurIPS 2025 LLM Evaluation Workshop PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Evaluation, Large language model
Abstract: Existing language model benchmarks provide contradictory model rankings, even for benchmarks capturing similar skills. This hampers model selection and adds confusion to the growing ecosystem of competing models. We propose a fundamental shift in evaluation methodology: rather than measuring out-of-the-box performance, we assess model potential---achievable performance after task-specific fine-tuning. Our *train-before-test* approach provides each model with identical benchmark-specific fine-tuning prior to evaluation. Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models. First, we demonstrate that model potential rankings through train-before-test exhibit remarkable consistency across all benchmarks. While traditional rankings show little external validity under direct evaluation, they enjoy significant external validity with train-before-test: model potential rankings transfer gracefully between benchmarks. Second, train-before-test restores the connection between perplexity and downstream task performance. For base models, even pre-fine-tuning perplexity predicts post-fine-tuning downstream performance, suggesting ranking consistency reflects inherent model potential rather than fine-tuning artifacts. Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating model potential is dominated by one latent factor.
Submission Number: 50
Loading