Abstract: The growing need for domain-specific large language models (LLMs), underscores the importance of Domain Adaptive Pre-training (DAP) in enhancing downstream task performance. While existing research has established scaling laws for corpus mixture optimization, the scaling laws governing factual knowledge injection remain unexplored. This paper bridges this gap by conducting a case study on Arabic domain-specific factual knowledge injection via DAP. Unlike traditional scaling laws, which rely on token counts and cross-entropy loss, our approach introduces two key innovations: (1) scaling training data based on domain knowledge volume rather than corpus size, and (2) using a knowledge-oriented evaluation method. We developed a scalable data synthesis pipeline that extracts factual knowledge triples from Arabic Wikipedia, generates diverse templates, and populates them to create training data. Experiments on pre-trained models of varying sizes yielded a log-linear scaling trend incorporating model size, knowledge volume, and exposure frequency, indicating a potential practical value in guiding knowledge injection trainings.
Paper Type: Long
Research Area: Interpretability and Analysis of Models for NLP
Research Area Keywords: Large Language Model, scaling law, Domain Adaptive Pre-training
Contribution Types: Model analysis & interpretability
Languages Studied: English
Submission Number: 8018
Loading