When Scale is Fixed: Revisiting Pre-training Indicators for LLM Fine-tuning Performance

ACL ARR 2025 July Submission284 Authors

26 Jul 2025 (modified: 20 Aug 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: While scaling laws tell us that metrics like perplexity effectively indicate how a model performs as it grows, we still don't fully grasp its predictive power at a fixed size. This lack of clarity makes it challenging to conduct effective ablation studies on smaller models, for example, when trying out various pre-training objectives. Since a primary application for these pre-trained models is supervised fine-tuning (SFT) on specific data or tasks, it's crucial for our ablation studies to connect this post-SFT performance back to the initial pre-training choices. This helps us conduct more effective pre-training research. To study this problem, we first construct a dataset using 50 1B parameter LLM variants with systematically varied pre-training configurations, e.g., objectives or data, and evaluate them on diverse downstream tasks after supervised fine-tuning (SFT). We demonstrate that the conventional perplexity is a highly misleading indicator in this scenario. To address this gap, we formulate the task of selecting pre-training checkpoints to maximize downstream fine-tuning performance as a pairwise classification problem: predicting which of two LLMs, differing in their pre-training, will perform better after SFT. We introduce novel unsupervised and supervised proxy metrics derived from pre-training that successfully reduce the relative performance prediction error rate by over 50\% when comparing with existing methods. Despite the inherent complexity of this task, we demonstrate the practical utility of our proposed proxies in specific scenarios, paving the way for more efficient design of pre-training schemes optimized for various downstream tasks.
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: pre-training, fine-tuning, scaling
Languages Studied: English
Reassignment Request Area Chair: This is not a resubmission
Reassignment Request Reviewers: This is not a resubmission
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: No
A2 Elaboration: We analyze the differnet LLMs pretraining config impact for the downstream SFT tasks. We don't see the specific risk for that.
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Yes — see Section 2 and Section 6
B2 Discuss The License For Artifacts: No
B2 Elaboration: No — we used publicly available datasets and model objectives (e.g., SlimPajama, T5) but did not explicitly discuss their licenses.
B3 Artifact Use Consistent With Intended Use: No
B3 Elaboration: No — we did not explicitly discuss usage consistency or specify intended use for derived artifacts, though all artifacts were used strictly for research purposes.
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: No — we used SlimPajama, a publicly released and filtered dataset, but did not perform additional checks for PII or offensive content.
B5 Documentation Of Artifacts: No
B5 Elaboration: No — while we mention domain coverage and tagging strategies (Section 2.1, Appendix A), we do not provide detailed documentation on language, linguistic phenomena, or demographics.
B6 Statistics For Data: Yes
B6 Elaboration: Sec 2.1
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: 1B parameters.
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Sec 2.1
C3 Descriptive Statistics: Yes
C3 Elaboration: Sec 4.2
C4 Parameters For Packages: No
C4 Elaboration: No — we did not use or report specific third-party packages for preprocessing or evaluation beyond standard SFT accuracy metrics.
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D1 Elaboration: no human participants or annotators were involved in this study.
D2 Recruitment And Payment: N/A
D2 Elaboration: N/A — no participants were recruited or paid in this study.
D3 Data Consent: No
D3 Elaboration: No — we used publicly available datasets and did not collect new data from individuals, so consent was not explicitly discussed.
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
D5 Elaboration: N/A — no new data was collected from human subjects; we only used existing public datasets.
E Ai Assistants In Research Or Writing: No
E1 Information About Use Of Ai Assistants: N/A
Author Submission Checklist: yes
Submission Number: 284
Loading