Modelling Optimal Trade-Off Between Continued Pre-Training and Supervised Fine-Tuning for LLM Domain Adaptation

Modelling Optimal Trade-Off Between Continued Pre-Training and Supervised Fine-Tuning for LLM Domain Adaptation

ICLR 2026 Conference Submission25437 Authors

20 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Machine Learning, Continuous Pretraining, Supervised Fine Tuning, Parameter-Efficient Fine-Tuning (PEFT), Optimization

TL;DR: Finding the optimal data allocation between CPT and SFT for domain adaptation

Abstract: Domain adaptation is critical for tailoring pre-trained Large Language Models (LLMs) to specialised tasks without significant costs of pre-training from scratch. Two common approaches for domain adaptation are Continual Pre-training (CPT) and Supervised Fine-Tuning (SFT), yet the data mix for each is often determined arbitrarily based on data availability or through limited data ablations. In this paper, we present a mathematical framework to model downstream domain performance as a function of the ratio between CPT and SFT under a fixed token budget. Using 7B-parameter pre-trained LLMs, we perform domain adaptation training across three domains - health, chemistry, and coding - within a 30B-token limit. CPT uses domain-relevant subsets of Nvidia's ClimbLab dataset, while SFT employs medqa (health), OpenCodeInstruct (programming), and ChemData700k (chemistry). Resultant models are evaluated on domain-specific QA benchmarks across sixteen CPT:SFT allocations. Results show that optimal performance, regardless of domain, arises from allocations with effective CPT:SFT token ratios between 29.9976B:2.4M and 29.9982B:1.8M corresponding to a CPT fraction of approximately 0.99992 - 0.99994. Our optimal split demonstrated an 11.6% score improvement over the state-of-the-art domain-adapted model Code Llama and a 6.4% increase in performance on MedQA over HippoCrates Meta 7B while approaching the performance of HippoCrates Mistral 7B, at up to 95% token budget reduction. We further validate these findings through ablation with trained models to better understand the impact of individual datasets on resultant model weights. Our work provides a framework for guiding efficient domain adaptation of LLMs through CPT and SFT.

Primary Area: optimization

Submission Number: 25437

Loading