## Domain Specific Sythetic Data

source code for the "Domain-Specific Data Synthesis for LLMs through Minimal Sufficient Representation Learning"

### Abstract 
Large Language Models have demonstrated remarkable progress in general-purpose capabilities and can achieve strong performance in specific domains through fine-tuning on domain-specific data. However, acquiring high-quality data for target domains remains a significant challenge. Existing data synthesis approaches heavily rely on explicit domain descriptions expressed in natural language and careful prompt engineering, limiting their applicability in real-world scenarios where domains are difficult to describe or formally articulate. In this work, we tackle the underexplored problem of domain-specific data synthesis under implicit supervision, where the target domain is defined only through a small set of reference examples. We propose a novel framework, DOMINO, that learns a minimal sufficient domain representation from reference samples and leverages it to guide the generation of domain-aligned synthetic data. DOMINO integrates prompt tuning with a contrastive disentanglement objective to separate domain-level patterns from sample-specific noise, mitigating overfitting while preserving core domain characteristics. Theoretically, we prove that DOMINO expands the support of the synthetic data distribution, ensuring greater diversity. Empirically, we demonstrate its effectiveness and robustness across domains lacking explicit textual definitions. This work establishes a new paradigm for domain-specific data synthesis, enabling practical and scalable domain adaptation without the need for manual prompt design or natural language domain specifications.


### Code
**Note that the code is still being organized and refactored and will be fully open-sourced.**

Source code consists of two parts:
1. The first part involves using only the public soft token to represent the domain, and then leveraging the public soft token (domain-level) to guide the LLM in synthesizing domain data.

2. The second part uses the minimal sufficient representation to represent the domain (with contrastive loss training), which includes both the domain-level soft token and the sample-level soft tokens. Finally, the domain-level soft token is used to guide the LLM to synthesize domain data.
