Keywords: Evaluation, Large language model
Abstract: Existing language model benchmarks provide contradictory model rankings, even for benchmarks capturing similar skills.
This hampers model selection and adds confusion to the growing ecosystem of competing models.
We propose a fundamental shift in evaluation methodology: rather than measuring out-of-the-box performance, we assess model potential---achievable performance after task-specific fine-tuning.
Our *train-before-test* approach provides each model with identical benchmark-specific fine-tuning prior to evaluation.
Our primary contribution is a comprehensive empirical evaluation of model potential across 24 benchmarks and 61 models.
First, we demonstrate that model potential rankings through train-before-test exhibit remarkable consistency across all benchmarks.
While traditional rankings show little external validity under direct evaluation, they enjoy significant external validity with train-before-test: model potential rankings transfer gracefully between benchmarks.
Second, train-before-test restores the connection between perplexity and downstream task performance.
For base models, even pre-fine-tuning perplexity predicts post-fine-tuning downstream performance, suggesting ranking consistency reflects inherent model potential rather than fine-tuning artifacts.
Finally, train-before-test reduces the model-score matrix to essentially rank one, indicating model potential is dominated by one latent factor.
Submission Number: 50
Loading