Scaling Law of Factual Knowledge Injection for LLMs: A Case Study on Arabic Domain

Scaling Law of Factual Knowledge Injection for LLMs: A Case Study on Arabic Domain

ACL ARR 2025 February Submission8018 Authors

16 Feb 2025 (modified: 09 May 2025)ACL ARR 2025 February SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: The growing need for domain-specific large language models (LLMs), underscores the importance of Domain Adaptive Pre-training (DAP) in enhancing downstream task performance. While existing research has established scaling laws for corpus mixture optimization, the scaling laws governing factual knowledge injection remain unexplored. This paper bridges this gap by conducting a case study on Arabic domain-specific factual knowledge injection via DAP. Unlike traditional scaling laws, which rely on token counts and cross-entropy loss, our approach introduces two key innovations: (1) scaling training data based on domain knowledge volume rather than corpus size, and (2) using a knowledge-oriented evaluation method. We developed a scalable data synthesis pipeline that extracts factual knowledge triples from Arabic Wikipedia, generates diverse templates, and populates them to create training data. Experiments on pre-trained models of varying sizes yielded a log-linear scaling trend incorporating model size, knowledge volume, and exposure frequency, indicating a potential practical value in guiding knowledge injection trainings.

Paper Type: Long

Research Area: Interpretability and Analysis of Models for NLP

Research Area Keywords: Large Language Model, scaling law, Domain Adaptive Pre-training

Contribution Types: Model analysis & interpretability

Languages Studied: English

Submission Number: 8018

Loading