Abstract: Governance of the Commons Simulation (GovSim) is a Large Language Model (LLM) multi-agent framework designed to study cooperation and sustainability between LLM agents in resource-sharing environments (Piatti et al., 2024). Understanding the cooperation capabilities of LLMs is vital to the real-world applicability of these models. This study reproduces and extends the original GovSim experiments using recent small-scale open-source LLMs, including newly released instruction-tuned models such as Phi-4 and DeepSeek-R1 distill variants. We evaluate three core claims from the original paper: (1) GovSim enables the study and benchmarking of emergent sustainable behavior, (2) only the largest and most powerful LLM agents achieve a sustainable equilibrium, while smaller models fail, and (3) agents using universalization-based reasoning significantly improve sustainability. Our findings support the first claim, demonstrating that GovSim remains a valid platform for studying social reasoning in multi-agent LLM systems. However, our results challenge the second claim: recent smaller-sized LLMs, particularly DeepSeek-R1-Distill-Qwen-14B, achieve sustainable equilibrium, indicating that advancements in model design and instruction tuning have narrowed the performance gap with larger models. Regarding the third claim, our results confirm that universalization-based reasoning improves performance in the GovSim environment, supporting the third claim of the author. However, further analysis suggests that the improved performance primarily stems from the numerical instructions provided to agents rather than the principle of universalization itself. To further generalize these findings, we extended the framework to include a broader set of social reasoning frameworks. We find that reasoning strategies incorporating explicit numerical guidance consistently outperform abstract ethical prompts, highlighting the critical role of prompt specificity in influencing agent behavior.
Submission Length: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: We would like to thank the reviewers again for their feedback which have taken on board to make the following changes which can be found in our updated manuscript:
## _Testing Larger LLMs:_
As some reviewers raised concerns about our results being limited by our analysis of smaller models only, our updated manuscript now includes results for Llama-2-70B. This is alongside new results on Qwen-14B, and Phi-4. We believe that this new diversity of models allows us to better demonstrate the difference in multi-agent reasoning performance between different sizes and types of model.
Section 3.2 and Table 2 have been expanded to include the description of three additional models.
## _In-depth analysis of subskills:_
The subskills section of the paper is meant to help understand the reason for a model's performance, or lack thereof, in the different scenarios. This is akin to behavioural psychology research, where participants' understanding of the scenarios in which they are tested before their results are seriously considered. To give more insight into the subskills, we have gone beyond reproducing the results from the original paper and now analyse how the different subskills may be better or worse predictors of performance based on the scenario in which the model is tested i.e. "pollution" vs "pasture".
Relatedly, section 3.1.2 (subskills) was rewritten to better motivate the idea behind the subskills and clarify precisely what each subskill is measuring.
While we did float the idea of demonstrating the effect of social reasoning frameworks on the subskills to one of the reviewers, we ultimately found that this was not a particularly informative test. We apologise for being too zealous in this regard, but hope that the new experiments we performed regarding social reasoning (detailed below) will provide more new insights.
## _New social reasoning experiments_
One reviewer raised concerns over the "prompt directiveness" being an explanation for the improved performance. In an attempt to disentangle the instructiveness and the provision of numerical answers, we have included new experiments in the paper which compares each social reasoning framework using three increasingly directive prompts in order to understand precisely where performance increases come from.
We show that while some social reasoning frameworks improve model performance through clear instructions alone, others require being told the correct answer before actually being useful, disqualifying their usefulness as a method for improving model performance.
Section 3.1.3 (Social reasoning) has been rewritten to include the explanation on how the new social reasoning levels experiment was conducted.
## _Move discussion of human-agent interaction_
We have moved the section on human-agent interaction (3.1.5) to the discussion section so that it doesn't disrupt the flow of the main paper.
## _Miscellaneous other changes_
Based on our previous major changes, we have also made the following smaller ones:
1. Section 4 (Results): This section has been entirely rewritten based on our new experiments and to improve the overall storyline of the results. Every plot in this section has been changed, either adding new datapoints to them or they are new plots.
2. Section 5 (Discussion): Updated to reflect the new results and interpretations.
3. Section 5.2 (Future work): has been extended with a second paragraph discussing directions related to hybrid systems of human and AI agents.
We would like to emphasise that we considered each suggestion made by the reviewers and would like to thank them all once again for their time and effort put into the reviews. We strongly believe that the paper is in a much clearer and informative state thanks to the prompt and constructive reviews provided.
Assigned Action Editor: ~Yaodong_Yang1
Submission Number: 4296
Loading