TiKMiX: Efficient Semi-Dynamic Data Mixture via Data Influence for LLM Pre-training

ACL ARR 2026 January Submission2853 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: LLM Pre-training; Data mixture
Abstract: The data mixture used in the pre-training of a language model is a cornerstone of its final performance. Static data mixing strategies in Large Language Model (LLM) pre-training are often suboptimal as they fail to adapt to the model's evolving learning states. Conversely, fully online dynamic updates, while adaptive, incur prohibitive computational costs. To bridge this gap, we propose TiKMiX, an efficient semi-dynamic data mixing framework. Our approach is grounded in a key observation of influence ranking invariance: the relative importance of data domains exhibits strong temporal stability over long training intervals. Leveraging this insight, we introduce Group Influence to quantify domain impact efficiently, formulating data mixing as a periodic, low-overhead influence maximization problem. such as REGMIX while reducing computational overhead by 80\%. Furthermore, it achieves an average performance gain of 2\% across 9 downstream benchmarks, effectively mitigating the issue of data "under-digestion".
Paper Type: Long
Research Area: Language Models
Research Area Keywords: Language Modeling
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2853
Loading