DomAINS - DOMain Adapted INStructions

DomAINS - DOMain Adapted INStructions

ACL ARR 2025 May Submission5713 Authors

20 May 2025 (modified: 03 Jul 2025)ACL ARR 2025 May SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Abstract: Domain-specific large language models (LLMs) demonstrate strong domain expertise by training on large-scale, domain-aligned instruction data. However, manually constructing such datasets is resource-intensive due to the need for expert annotators. A promising alternative is to use LLMs to synthesize training data. While existing frameworks effectively generate general instruction datasets, generating domain-specific instruction datasets presents the following main challenges: the data must (1) be strongly aligned with the target domain, (2) exhibit high in-domain diversity, and (3) be factually grounded on domain-specific knowledge. In this paper, we present DomAINS, a three-stage framework to generate instruction datasets for any target domain using only a domain name and a brief description. DomAINS constructs a tree of domain-relevant keywords to increase in-domain diversity, retrieves factually grounded domain articles from Bing, and prompts an LLM to generate domain-aligned instruction data based on the retrieved articles. Our evaluation across nine domains shows that models tuned on DomAINS-generated dataset achieve 60–95% win rate over those trained on datasets from existing synthetic frameworks for general domains, demonstrating the effectiveness of our approach.

Paper Type: Long

Research Area: Generation

Research Area Keywords: Language Modeling, Question Answering, Resources and Evaluation, Dialogue and Interactive Systems, Ethics, Bias, and Fairness

Contribution Types: Approaches to low-resource settings, Data resources

Languages Studied: English

Submission Number: 5713

Loading