TiKMiX: Efficient Semi-Dynamic Data Mixture via Data Influence for LLM Pre-training

TiKMiX: Efficient Semi-Dynamic Data Mixture via Data Influence for LLM Pre-training

ACL ARR 2026 January Submission2853 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: LLM Pre-training; Data mixture

Abstract: The data mixture used in the pre-training of a language model is a cornerstone of its final performance. Static data mixing strategies in Large Language Model (LLM) pre-training are often suboptimal as they fail to adapt to the model's evolving learning states. Conversely, fully online dynamic updates, while adaptive, incur prohibitive computational costs. To bridge this gap, we propose TiKMiX, an efficient semi-dynamic data mixing framework. Our approach is grounded in a key observation of influence ranking invariance: the relative importance of data domains exhibits strong temporal stability over long training intervals. Leveraging this insight, we introduce Group Influence to quantify domain impact efficiently, formulating data mixing as a periodic, low-overhead influence maximization problem. such as REGMIX while reducing computational overhead by 80\%. Furthermore, it achieves an average performance gain of 2\% across 9 downstream benchmarks, effectively mitigating the issue of data "under-digestion".

Paper Type: Long

Research Area: Language Models

Research Area Keywords: Language Modeling

Contribution Types: NLP engineering experiment

Languages Studied: English

Submission Number: 2853

Loading