Abstract: Domain-specific large language models (LLMs) demonstrate strong domain expertise by training on large-scale, domain-aligned instruction data. However, manually constructing such datasets is resource-intensive due to the need for expert annotators. A promising alternative is to use LLMs to synthesize training data. While existing frameworks effectively generate general instruction datasets, generating domain-specific instruction datasets presents the following main challenges: the data must (1) be strongly aligned with the target domain, (2) exhibit high in-domain diversity, and (3) be factually grounded on domain-specific knowledge. In this paper, we present DomAINS, a three-stage framework to generate instruction datasets for any target domain using only a domain name and a brief description. DomAINS constructs a tree of domain-relevant keywords to increase in-domain diversity, retrieves factually grounded domain articles from Bing, and prompts an LLM to generate domain-aligned instruction data based on the retrieved articles. Our evaluation across nine domains shows that models tuned on DomAINS-generated dataset achieve 60–95% win rate over those trained on datasets from existing synthetic frameworks for general domains, demonstrating the effectiveness of our approach.
Paper Type: Long
Research Area: Generation
Research Area Keywords: Language Modeling, Question Answering, Resources and Evaluation, Dialogue and Interactive Systems, Ethics, Bias, and Fairness
Contribution Types: Approaches to low-resource settings, Data resources
Languages Studied: English
Submission Number: 5713
Loading