TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Sheng Wang; Pengan CHEN; Jingqi Zhou; Qintong Li; Jingwei Dong; Jiahui Gao; Boyang XUE; Jiyue Jiang; Lingpeng Kong; Chuan Wu

TreeSynth: Synthesizing Diverse Data from Scratch via Tree-Guided Subspace Partitioning

Sheng Wang, Pengan CHEN, Jingqi Zhou, Qintong Li, Jingwei Dong, Jiahui Gao, Boyang XUE, Jiyue Jiang, Lingpeng Kong, Chuan Wu

Published: 18 Sept 2025, Last Modified: 29 Oct 2025NeurIPS 2025 spotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Data Synthesis, Tree-Guided Space Partitioning, Data Diversity Enhancement

TL;DR: We propose TreeSynth, a tree-guided subspace-based data synthesis approach, achieving superior data diversity, model performance, robust scalability, and data balance efficacy.

Abstract: Model customization necessitates high-quality and diverse datasets, but acquiring such data remains time-consuming and labor-intensive. Despite the great potential of large language models (LLMs) for data synthesis, current approaches are constrained by limited seed data, model biases and low-variation prompts, resulting in limited diversity and biased distribution with the increase of data scales. To tackle this challenge, we introduce TreeSynth, a tree-guided subspace-based data synthesis approach inspired by decision trees. It constructs a spatial partitioning tree to recursively divide a task-specific full data space (i.e., root node) into numerous atomic subspaces (i.e., leaf nodes) with mutually exclusive and exhaustive attributes to ensure both distinctiveness and comprehensiveness, before synthesizing samples within each atomic subspace. This globally divide-and-synthesize method finally collects subspace samples into a comprehensive dataset, effectively circumventing repetition and space collapse to ensure the diversity of large-scale data synthesis. Furthermore, the spatial partitioning tree enables sample allocation into atomic subspaces, allowing the re-balancing of existing datasets for more balanced and comprehensive distributions. Empirically, extensive experiments across diverse benchmarks consistently validates the superior data diversity, model performance, and robust scalability of TreeSynth compared to both human-crafted datasets and peer data synthesis methods, with the average performance gain reaching 10%. Besides, the consistent improvements of TreeSynth-balanced datasets highlight its efficacious application to redistribute existing datasets for more comprehensive coverage and the induced performance enhancement. The code is available at https://github.com/cpa2001/TreeSynth.

Primary Area: Applications (e.g., vision, language, speech and audio, Creative AI)

Submission Number: 10446

Loading