Reproducibility Study: Understanding multi-agent LLM cooperation in the GovSim framework

TMLR Paper4296 Authors

21 Feb 2025 (modified: 19 Aug 2025)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Governance of the Commons Simulation (GovSim) is a Large Language Model (LLM) multi-agent framework designed to study cooperation and sustainability between LLM agents in resource-sharing environments (Piatti et al., 2024). Understanding the cooperation capabilities of LLMs is vital to the real-world applicability of these models. This study reproduces and extends the original GovSim experiments using recent small-scale open-source LLMs, including newly released instruction-tuned models such as Phi-4 and DeepSeek-R1 distill variants. We evaluate three core claims from the original paper: (1) GovSim enables the study and benchmarking of emergent sustainable behavior, (2) only the largest and most powerful LLM agents achieve a sustainable equilibrium, while smaller models fail, and (3) agents using universalization-based reasoning significantly improve sustainability. Our findings support the first claim, demonstrating that GovSim remains a valid platform for studying social reasoning in multi-agent LLM systems. However, our results challenge the second claim: recent smaller-sized LLMs, particularly DeepSeek-R1-Distill-Qwen-14B, achieve sustainable equilibrium, indicating that advancements in model design and instruction tuning have narrowed the performance gap with larger models. Regarding the third claim, our results confirm that universalization-based reasoning improves performance in the GovSim environment, supporting the third claim of the author. However, further analysis suggests that the improved performance primarily stems from the numerical instructions provided to agents rather than the principle of universalization itself. To further generalize these findings, we extended the framework to include a broader set of social reasoning frameworks. We find that reasoning strategies incorporating explicit numerical guidance consistently outperform abstract ethical prompts, highlighting the critical role of prompt specificity in influencing agent behavior.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would once again like to thank the reviewers and action editor for their feedback as we finalize our submission. Following the suggested revisions from the action editor, we have made the following changes in our updated manuscript: ### Subskill universalisation ablation: To further examine the impact of universalisation on model performance, we conducted additional experiments by applying the subskill tests to models using universalisation-based reasoning. The results of these experiments have been incorporated into the updated Figure 2. We believe that the revised figure provides greater clarity on how models using the universalisation reasoning framework structurally outperform those that do not. ### Claim on “Small model success”: Upon reflection, we acknowledge that making claims about the success of smaller models without a supporting mechanistic analysis is not sufficiently justified. Accordingly, we have revised the discussion to present these remarks in a more conservative and nuanced manner. ### Claim on “Human-Machine Interaction”: We agree that the current manuscript lacks sufficient empirical evidence to support mentioning the “human-machine interaction” in the introduction. In line with the suggested revision, we have removed this reference from the introduction.
Code: https://github.com/Mathijsvs03/Re-GovSim
Assigned Action Editor: ~Yaodong_Yang1
Submission Number: 4296
Loading