Structure-aware Domain Knowledge Injection for Large Language Models

Kai Liu; Ze Chen; Zhihang Fu; Rongxin Jiang; Fan Zhou; Yaowu Chen; Yue Wu; Jieping Ye

Structure-aware Domain Knowledge Injection for Large Language Models

Kai Liu, Ze Chen, Zhihang Fu, Rongxin Jiang, Fan Zhou, Yaowu Chen, Yue Wu, Jieping Ye

18 Sept 2024 (modified: 16 Dec 2024)ICLR 2025 Conference Withdrawn SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: knowledge injection, structured knowledge, large language models

Abstract: This paper introduces a pioneering methodology, termed StructTuning, to efficiently transform foundation Large Language Models (LLMs) into domain specialists. It significantly reduces the training corpus requirement to a mere 0.3%, while achieving an impressive 50% of traditional knowledge injection performance. Our method is inspired by the educational processes of human students, particularly how structured domain knowledge from textbooks is assimilated and subsequently applied to tackle real-world challenges through specific exercises. Based on this, we propose a novel two-stage strategy for knowledge injection and alignment: Structure-aware Continual Pre-Training (SCPT) and Structure-aware Supervised Fine-Tuning (SSFT). In the SCPT phase, we automatically extract the domain knowledge taxonomy and reorganize the training corpora, enabling LLMs to effectively link textual segments to targeted knowledge points within the taxonomy. In the SSFT phase, we explicitly prompt models to elucidate the underlying knowledge structure in their outputs, leveraging the structured domain insight to address practical problems. Our ultimate method has undergone extensive evaluations across model architectures and scales, using closed-book question-answering tasks on LongBench and MMedBench datasets. Furthermore, we have investigated the scalability of structure-aware knowledge injection across varying sizes of training corpora, which lays a foundation for scaling up our StructTuning for stronger domain-specific LLMs with comprehensive data utilization. Code is available at this anonymous URL: https://anonymous.4open.science/r/StructTuning/.

Primary Area: foundation or frontier models, including LLMs

Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.

Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2025/AuthorGuide.

Reciprocal Reviewing: I understand the reciprocal reviewing requirement as described on https://iclr.cc/Conferences/2025/CallForPapers. If none of the authors are registered as a reviewer, it may result in a desk rejection at the discretion of the program chairs. To request an exception, please complete this form at https://forms.gle/Huojr6VjkFxiQsUp6.

Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors’ identity.

No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.

Submission Number: 1586

Loading