Keywords: Large Language Models, Synthetic Dataset Generation, Text Generation, Soft Labels, Prompt Engineering
Abstract: The scarcity of high-quality labeled data remains a critical bottleneck in natural language processing, while existing synthetic data generation approaches using Large Language Models (LLMs) rely on rigid categorical conditioning that produces polarized and unrealistic text. We introduce Probabilistic Prompting, a framework that achieves fine-grained control over LLM text generation by conditioning on continuous probability vectors rather than discrete categorical instructions. To realize this framework, we propose SoftGen, a zero-shot method implementing a three-stage pipeline: (1) sampling probability vectors from tailored distributions, (2) generating text conditioned on probabilistic prompts, and (3) self-verification to ensure high fidelity. Through comprehensive evaluation on five text classification benchmarks, we demonstrate three key contributions. First, we provide rigorous analysis of generation fidelity, revealing that LLMs can faithfully follow probabilistic instructions and uncovering systematic relationships between label entropy and generation quality that vary by task dimensionality. Second, we show substantial downstream utility: models trained on our synthetic data achieve improved accuracy and calibration compared to traditional categorical approaches.
Third, we establish theoretical foundations grounded in the Maximum Entropy Principle, including formal definitions of Generator Calibration and mathematical proofs connecting prompt entropy to output diversity. Our work demonstrates that preserving continuous probability structure in synthetic data generation provides richer supervisory signals and enables more realistic, diverse datasets better reflecting the continuous nature of semantic properties in natural language.
Primary Area: foundation or frontier models, including LLMs
Submission Number: 17584
Loading