Building Domain-Specific Small Language Models on a Shoestring via Guided Data Generation

Aman Kumar; Ekant Muljibhai Amin; Xian Yeow Lee; Lasitha Vidyaratne; Ahmed K. Farahat; Yuta Koreeda; Chetan Gupta

Building Domain-Specific Small Language Models on a Shoestring via Guided Data Generation

Aman Kumar, Ekant Muljibhai Amin, Xian Yeow Lee, Lasitha Vidyaratne, Ahmed K. Farahat, Yuta Koreeda, Chetan Gupta

Published: 09 Jun 2025, Last Modified: 08 Jul 2025KDD 2025 Workshop SciSocLLMEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Small language models, domain-specific models, parameter-efficient training, guided data generation

Abstract: Large Language Models (LLMs) have shown remarkable success in supporting a wide range of knowledge-intensive tasks. In specialized domains, there is growing interest in leveraging LLMs to assist subject matter experts with domain-specific challenges. However, deploying LLMs as SaaS solutions raises data privacy concerns, while many open-source models demand significant computational resources for effective domain adaptation and deployment. A promising alternative is to develop smaller, domain-specialized LLMs, though this approach is often constrained by the lack of high-quality domain-specific training data. In this work, we address these limitations by presenting a cost-efficient and scalable training pipeline that combines guided synthetic data generation from a small seed corpus with bottom-up domain data curation. Our pipeline integrates Domain-Adaptive Pre-training (DAPT), Domain-specific Supervised Fine-Tuning (DSFT), and Direct Preference Optimization (DPO) to train effective small-scale models for specialized use cases. We demonstrate this approach through DiagnosticSLM, a 3B-parameter language model tailored for fault diagnosis, root cause analysis, and repair recommendation in industrial settings. To evaluate model performance, we introduce four domain-specific benchmarks: multiple-choice questions (DiagnosticMCQ), question answering (DiagnosticQA), sentence completion (DiagnosticComp), and summarization (DiagnosticSum). DiagnosticSLM achieves a 13-25\% accuracy improvement over open-source models of comparable or larger size (2B-9B) on the MCQ task, while also outperforming or matching them in other tasks, demonstrating strong domain-specific reasoning and generalization capabilities.

Submission Number: 3

Loading