Training Mice to Compete with Elephants: A Guide for Customizing Small-Sized LLMs on Knowledge and Skills Data
Keywords: Machine Learning, Generative Models, Large Language Models, Natural Language Processing, Transformers, Fine-Tuning, Instruction Tuning, Synthetic Data Generation, Knowledge Data, Skills Data, Model Generalization, Batch Size, Hyperparameter Optimization, Gradient Norm, MMLU, MTBench, Stacked Training, Phased Training, Compute Efficiency, Sample Efficiency, Flash Attention, Multipack Bucketing
TL;DR: We provide a guide for customizing small LLMs (3B-7B parameters) through instruction tuning on diverse knowledge and skills data, offering insights on training strategies, hyperparameters, and efficient tuning methods to improve performance.
Abstract: Customizing large language models (LLMs) is increasingly in demand by enterprises and individual developers. It allows LLMs to be tailored for domain expertise, aligned with organizational guidelines, and enhanced for user experience. Effective customization hinges on three core elements: a small-size model, large-scale domain-specific datasets, and an effective training strategy to help the model acquire relevant knowledge and skills from the data. In this paper, we focus on the third element by conducting an in-depth study on fine-tuning LLMs (3B to 7B parameters) using large-scale instruction tuning datasets across multiple knowledge domains and skills. We examine various training configurations and strategies on three pretrained LLMs. Our results question several common training practices, including hyperparameter recommendations from TULU and phased training recommended by Orca.
Key insights from our work include: (i) larger batch sizes paired with lower learning rates lead to improved model performance on benchmarks such as MMLU, MTBench, and Open LLM Leaderboard; (ii) early-stage training dynamics, such as lower gradient norms and higher loss values, are strong indicators of better final model performance, allowing for early termination of sub-optimal runs and significant computational savings; (iii) skipping warmup and using a constant learning rate do not compromise performance; and (iv) stacked training outperforms phased training. With these findings holding robustly across model families and sizes, we hope this study serves as a comprehensive guide for practitioners fine-tuning small LLMs.
Primary Area: foundation or frontier models, including LLMs
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.
Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 13333
Loading