From Amateur to Master: Infusing Domain Knowledge into LLMs via Automated Curriculum Learning

ACL ARR 2026 January Submission2489 Authors

03 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Knowledge Infusion, Curriculum Learning, Synthetic Data Generation
Abstract: Large Language Models (LLMs) excel at general tasks but underperform in specialized domains like economics and psychology, which require deep, principled understanding. To address this, we introduce ACER (Automated Curriculum-Enhanced Regimen), a framework for targeted domain knowledge infusion that combines structured synthetic corpus generation with curriculum-aligned continual pretraining. ACER synthesizes textbook-style curriculum with complementary question–answer pairs guided by Bloom’s taxonomy, enabling systematic coverage and progressive cognitive difficulty. The resulting synthetic corpus is used to drive curriculum-aligned continual pretraining, rather than relying on unstructured or naively mixed data. Experiments on Llama~3.2 (3B and 1B) show consistent improvements on five challenging MMLU subdomains, with gains of up to 5 percentage points in particularly difficult areas such as microeconomics and a macro-average improvement of about 3 points across target domains. Importantly, ACER preserves performance on non-target domains and often yields modest positive transfer. Beyond MMLU, ACER improves performance on knowledge-intensive benchmarks such as ARC and GPQA by over 2 absolute points, while maintaining stable performance on general reasoning tasks. Overall, ACER provides a scalable approach for infusing principled domain expertise into general-purpose LLMs without sacrificing their breadth.
Paper Type: Long
Research Area: Language Models
Research Area Keywords: pre-training, continual learning, transfer
Contribution Types: NLP engineering experiment
Languages Studied: English
Submission Number: 2489
Loading