AuditLLM: A Compact and Domain-Specialized Large Language Model Family for Intelligent Auditing via Two-Stage Continual Pre-Training
Abstract: Domain-specific LLMs have become an increasingly important research issue in recent years and various LLMs has been proposed to specific domains, such as finance, healthcare and legal. However, the current LLMs adopted in auditing faces critical challenges like cloud-API restrictions under data privacy compliance, hardware limitations in deploying trillion-parameter models, and deficiencies in factual accuracy and logical rigor exhibited by general-purpose LLMs in auditing contexts. This paper addresses training LLMs for auditing and proposes a two-phase framework to develop compact, audit-specialized LLMs tailored for Chinese auditing workflows. First, Qwen2.5 is selected as the base model through systematic comparisons of sub-5B parameter architectures. Subsequently, domain-adaptive continual pre-training by a carefully designed data sampling strategy is performed on a curated corpus of Chinese audit texts to inject domain expertise. Finally, multi-task instruction-tuning aligns the model with practical audit requirements. Extensive experiments demonstrate that the proposed framework can significantly improve the performance of domain specific LLMs in audit tasks, enhancing their accuracy and practicality for real-world applications. This study underscores the importance of domain-adaptive pre-training. The source codes, models, and audit-domain dataset are publicly available at https://anonymous.4open.science/r/AuditLLM-E004
Paper Type: Long
Research Area: Language Modeling
Research Area Keywords: Dialogue and Interactive Systems, Efficient/Low-Resource Methods for NLP, Interpretability and Analysis of Models for NLP
Contribution Types: NLP engineering experiment, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources
Languages Studied: Chinese, English
Submission Number: 7664
Loading