Reproducibility Study: Understanding multi-agent LLM cooperation in the GovSim framework

TMLR Paper4296 Authors

21 Feb 2025 (modified: 08 Jan 2026)Decision pending for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Governance of the Commons Simulation (GovSim) is a Large Language Model (LLM) multi-agent framework designed to study cooperation and sustainability between LLM agents in resource-sharing environments (Piatti et al., 2024). Understanding the cooperation capabilities of LLMs is vital to the real-world applicability of these models. This study reproduces and extends the original GovSim experiments using recent small-scale open-source LLMs, including newly released instruction-tuned models such as Phi-4 and DeepSeek-R1 distill variants. We evaluate three core claims from the original paper: (1) GovSim enables the study and benchmarking of emergent sustainable behavior, (2) only the largest and most powerful LLM agents achieve a sustainable equilibrium, while smaller models fail, and (3) agents using universalization-based reasoning significantly improve sustainability. Our findings support the first claim, demonstrating that GovSim remains a valid platform for studying social reasoning in multi-agent LLM systems. However, our results challenge the second claim: recent smaller-sized LLMs, particularly DeepSeek-R1-Distill-Qwen-14B, achieve sustainable equilibrium, indicating that advancements in model design and instruction tuning have narrowed the performance gap with larger models. Regarding the third claim, our results confirm that universalization-based reasoning improves performance in the GovSim environment, supporting the third claim of the author. However, further analysis suggests that the improved performance primarily stems from the numerical instructions provided to agents rather than the principle of universalization itself. To further generalize these findings, we extended the framework to include a broader set of social reasoning frameworks. We find that reasoning strategies incorporating explicit numerical guidance consistently outperform abstract ethical prompts, highlighting the critical role of prompt specificity in influencing agent behavior.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We appreciate the continued guidance from the reviewers and the action editor as we complete the final version of our manuscript. Since the previous submission, we have made the following final revisions in response to the action editor’s comments: - Page length: The main text of the manuscript has been revised to ensure it remains within the 12-page limit. - Appendix formatting: The spacing and layout in the appendix have been improved to reduce unnecessary whitespace. Aside from these formatting-related adjustments, no additional substantive changes were made to contents of the paper.
Code: https://github.com/Mathijsvs03/Re-GovSim
Assigned Action Editor: ~Yaodong_Yang1
Submission Number: 4296
Loading