LLMSynthor: Macro-Aligned Micro-Records Synthesis with Large Language Models

10 Sept 2025 (modified: 11 Feb 2026)Submitted to ICLR 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: Data Synthesis, Urban Studies, Social Simulation, Large Language Model
Abstract: Macro-aligned micro-records are essential for simulations in social science and urban studies. For instance, epidemic models of urban disease spread are only credible when micro-level records reproduce realistic individual mobility and contact patterns, while macro-level aggregates match real-world statistics such as case counts or travel flows. Still, large-scale collection of such fine-grained data is impractical, leaving researchers with only macro-statistics (e.g., travel surveys or case counts). Large Language Models (LLMs), leveraging rich real-world priors learned from vast corpora, excel at generating realistic micro-records, but standard record-by-record sampling is inefficient and fails to enforce alignment with target macro-statistics. Given this, we propose LLMSynthor, a framework capable of synthesizing realistic micro-records that are statistically aligned with target macro-statistics. LLMSynthor transforms a pre-trained LLM into a macro-aware simulator that incrementally builds a synthetic dataset through an iterative process. At each iteration, a batch of micro-records is generated to reduce the discrepancy between synthetic and target macro-statistics. By treating the LLM as a nonparametric copula for inferring joint dependencies over variable combinations, the iterative process ensures the synthetic data are macro-statistically aligned with the target marginals and joints. To address sampling inefficiency, we introduce LLM Proposal Sampling, where the LLM, guided by discrepancies, generates a plan of proposals, each defining specific values or ranges for all variables and specifying the number of records to generate. This enables the framework to minimize discrepancies efficiently while preserving the realism grounded in the LLM’s priors. Evaluations on synthetic and real-world datasets (mobility, e-commerce, population) encompassing diverse formats and settings show that LLMSynthor achieves high record realism, statistical fidelity, and practical utility, positioning it broadly applicable across economics, social science, urban studies, and beyond.
Supplementary Material: zip
Primary Area: other topics in machine learning (i.e., none of the above)
Submission Number: 3816
Loading