Drawing Reliable Conclusions with Imperfect Synthetic Data

Published: 29 Sept 2025, Last Modified: 12 Oct 2025NeurIPS 2025 - Reliable ML WorkshopEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Human-AI Collaboration, Statistical Inference
TL;DR: We introduce a principled approach for reliably incorporating synthetic data from LLMs for downstream statistical analyses
Abstract: Predictions and generations from large language models are increasingly being explored as an aid in limited data regimes, such as in computational social science and human subjects research. While prior technical work has mainly explored the potential to use model-predicted labels for unlabeled data in a principled manner, there is increasing interest in using large language models to generate entirely new synthetic samples (e.g., synthetic simulations), such as in responses to surveys. However, it remains unclear by what means practitioners can use synthetic data alongside real data without invalidating downstream statistical conclusions. In this paper, we introduce a new estimator based on generalized method of moments, providing a hyperparameter-free solution with strong theoretical guarantees to address this challenge. We find that interactions between the moment residuals of synthetic data and those of real data (i.e., when they are predictive of each other) can substantially improve estimates of the target parameter. To the best of our knowledge, our framework provides the first theoretically-sound approach for incorporating fully synthetic samples in downstream statistical analyses.
Submission Number: 216
Loading