Accuracy, Diversity, and Reflection: Purpose-driven Evaluation for Social Simulation
Keywords: Social Simulation, Large Language Models, Simulation Evaluation
TL;DR: LLM-based social simulations cannot be validated by predictive accuracy alone; they require a purpose-driven evaluation framework spanning Accuracy, Diversity, and Reflection to establish operational validity.
Abstract: With the rapid advancement of Large Language Models (LLMs), social simulations are evolving to capture richer interactions. However, current validation methodologies remain predominantly focused on predictive accuracy---assessing how closely outputs mimic ground truth data. In this position paper, we argue that this accuracy-centric view is insufficient, as simulations serve diverse purposes beyond prediction, such as exploring plausible futures and understanding underlying mechanisms. We propose a purpose-driven evaluation framework comprising three dimensions: Accuracy, Diversity, and Reflection. We define Accuracy and Diversity at micro-, meso-, and macro-levels, reflecting how social simulations should be valid not only in individual interactions but also in emergent group dynamics and aggregate outcomes, and that the methods for evaluating at each scale may be different. Furthermore, we introduce Reflection as a critical dimension to evaluate whether the simulation aids users in debugging assumptions (process) and interpreting outcomes (outcome). We conclude with guidelines for applying these complementary dimensions to establish the operational validity of LLM-based social simulations.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 3
Loading