Assessing the Macro and Micro Effects of Random Seeds on Fine-Tuning Large Language Models

ACL ARR 2025 July Submission356 Authors

27 Jul 2025 (modified: 04 Sept 2025)ACL ARR 2025 July SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The impact of random seeds in fine-tuning large language models (LLMs) has been largely overlooked despite its potential influence on model performance. In this study, we systematically evaluate the effects of random seeds on LLMs using the GLUE and SuperGLUE benchmarks. We analyze the macro impact through traditional metrics like accuracy and F1, calculating their mean and variance to quantify performance fluctuations. To capture the micro effects, we introduce a novel metric, consistency, measuring the stability of individual predictions across runs. Our experiments reveal significant variance at both macro and micro levels, underscoring the need for careful consideration of random seeds in fine-tuning and evaluation.
Paper Type: Short
Research Area: Special Theme (conference specific)
Research Area Keywords: evaluation methodologies, evaluation, generalization, metrics, reproducibility, robustness
Contribution Types: Reproduction study
Languages Studied: English
Previous URL: https://openreview.net/forum?id=hWYXNwJV5Y
Explanation Of Revisions PDF: pdf
Reassignment Request Area Chair: Yes, I want a different area chair for our submission
Reassignment Request Reviewers: No, I want the same set of reviewers from our previous submission (subject to their availability)
Justification For Not Keeping Action Editor Or Reviewers: We respectfully request a change of area chair, as the feedback provided in the previous cycle appeared to pertain to a different submission, not ours.
A1 Limitations Section: This paper has a limitations section.
A2 Potential Risks: N/A
B Use Or Create Scientific Artifacts: Yes
B1 Cite Creators Of Artifacts: Yes
B1 Elaboration: Section 4.1, Appendix A.4
B2 Discuss The License For Artifacts: Yes
B2 Elaboration: Section 4.1, Appendix A.4
B3 Artifact Use Consistent With Intended Use: Yes
B3 Elaboration: Section 4.1, 4.2, Appendix A.3, A.4
B4 Data Contains Personally Identifying Info Or Offensive Content: No
B4 Elaboration: We use standard benchmark datasets that are publicly available and de-identified.
B5 Documentation Of Artifacts: N/A
B6 Statistics For Data: Yes
B6 Elaboration: Appendix A.2
C Computational Experiments: Yes
C1 Model Size And Budget: Yes
C1 Elaboration: Section 4, Appendix A.2, A.3, A.4
C2 Experimental Setup And Hyperparameters: Yes
C2 Elaboration: Section 4, Appendix A.2, A.3, A.4
C3 Descriptive Statistics: Yes
C3 Elaboration: Section 5, Appendix A.5
C4 Parameters For Packages: Yes
C4 Elaboration: Section 4.2
D Human Subjects Including Annotators: No
D1 Instructions Given To Participants: N/A
D2 Recruitment And Payment: N/A
D3 Data Consent: N/A
D4 Ethics Review Board Approval: N/A
D5 Characteristics Of Annotators: N/A
E Ai Assistants In Research Or Writing: Yes
E1 Information About Use Of Ai Assistants: No
E1 Elaboration: We use AI to refine sentences during writing, improving clarity and flow. However, no AI is involved in the research or coding process or story telling.
Author Submission Checklist: yes
Submission Number: 356
Loading