Keywords: Human-AI Collaboration, Reliable Statistical Inference, LLMs for Social Science
TL;DR: We introduce a principled approach for reliably incorporating fully synthetic samples from LLMs for downstream statistical analyses
Abstract: There is increasing interest in using large language models to generate entirely new synthetic samples to support social science and human subject research, such as in responses to surveys or in human behavior simulation. However, it is not immediately clear
by what means practitioners can incorporate such data and yet draw reliable insights and conclusions upon them.
In this work, we introduce a principled framework for reliably incorporating fully synthetic samples from text-based foundation models into downstream statistical analyses. Our estimator offers a hyperparameter-free solution with strong theoretical guarantees, allowing practitioners to retain key statistical properties---even when incorporating imperfect, biased synthetic data. We empirically validate the finite-sample performance of our estimator, which improves statistical efficiency, across different regression tasks in social science applications. To the best of our knowledge, our framework provides the first theoretically-sound approach for safely incorporating synthetic samples from foundation models for reliable statistical inference.
Submission Number: 151
Loading