# PyGAAMAS

Python Generative Autonomous Agents and Multi-Agent Systems aims to evaluate 
the social behaviors of LLM-based agents.

This prototype explores the potential of *homo silicus* for social
simulation. We examine the behaviour exhibited by intelligent
machines, particularly how generative agents deviate from
the principles of rationality. To assess their responses to simple human-like
strategies, we employ a series of tightly controlled and theoretically
well-understood games. Through behavioral game theory, we evaluate the ability
of <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Mistral-Small</tt>}, and
<tt>DeepSeek-R1</tt> to make coherent one-shot
decisions, generate algorithmic strategies based on explicit preferences, adhere
to first- and second-order rationality principles, and refine their beliefs in
response to other agents’ behaviours.

# Table of Contents

1. [Introduction](#introduction)
2. [Economic Rationality](#economic-rationality)
3. [Preferences](#preferences)
   - [Preference Elicitation](#preference-elicitation)
   - [Preference Alignment](#preference-alignment)
4. [Social Preference](#social-preference)
5. [Strategic Rationality](#strategic-rationality)
   - [First Order Rationality](#first-order-rationality)
   - [Second-Order Rationality](#second-order-rationality)
6. [Beliefs - MP](#beliefs---mp)
7. [Beliefs - RPS](#beliefs---rps)
   - [Refine Beliefs](#refine-beliefs)
   - [Assess Beliefs](#assess-beliefs)
8. [Rational vs Credible](#rational-vs-credible)
9. [Coordination](#coordination)
   - [Agent-Human Coordination](#agent-human-coordination)
   - [Agent-Agent Coordination](#agent-agent-coordination)
10. [Synthesis](#synthesis)
11. [License](#license)

## Economic Rationality

To evaluate the economic rationality of various LLMs, we introduce an investment game 
designed to test whether these models follow stable decision-making patterns or react 
erratically to changes in the game’s parameters.

In this game, an investor allocates a basket $x_t=(x^A_t, x^B_t)$ of $100$ points between 
two assets: Asset A and Asset B. The value of these points depends on random prices $p_t=(p_{t}^A, p_t^B)$, 
which determine the monetary return per allocated point. For example, if $p_t^A= 0.8$ and $p_t^B = 0.8$, 
each point assigned to Asset A is worth $\$0.8$, while each point allocated to Asset B yields $\$0.5$. 
The game is played $25$ times to assess the consistency of the investor’s decisions.

To evaluate the rationality of the decisions, we use Afriat's
critical cost efficiency index (CCEI), i.e. a widely used measure in
experimental economics. The CCEI assesses whether choices adhere to the
generalized axiom of revealed preference (GARP), a fundamental principle of
rational decision-making. If an individual violates rational choice consistency,
the CCEI determines the minimal budget adjustment required to make their
decisions align with rationality. Mathematically, the budget for each basket is
calculated as: $ I_t = p_t^A \times x^A_t + p_t^B \times x^B_t$. The CCEI is
derived from observed decisions by solving a linear optimization problem that
finds the largest $\lambda$, where $0 \leq \lambda \leq 1$, such that for every
observation, the adjusted decisions satisfy the rationality constraint: $p_t
\cdot x_t \leq \lambda I_t$. This means that if we slightly reduce the budget,
multiplying it by $\lambda$, the choices will become consistent with rational
decision-making. A CCEI close to 1 indicates high rationality and consistency
with economic theory. A low CCEEI suggests irrational or inconsistent
decision-making. n their 2007 study on portfolio choices, Choi et al. found 
that participants exhibited a high degree of rationality, with average CCEI values 
around 0.95:
Choi, S., Fisman, R., Gale, D., & Kariv, S. (2007). 
*Consistency and heterogeneity of individual behavior under uncertainty*. American Economic Review, 97(5), 1921–1938.

To ensure response consistency, each model undergoes $30$ iterations of the game
with a fixed temperature of $0.0$. The results shown in
Figure below highlight significant differences in decision-making
consistency among the evaluated models. <tt>GPT-4.5</tt>, <tt>LLama3.3:latest</tt> 
and <tt>DeepSeek-R1:7b</tt> stand out with a
perfect CCEI score of 1.0, indicating flawless rationality in decision-making.
<tt>Qwen3</tt>, <tt>Mistral-Small</tt> and <tt>Mixtral:8x7b</tt> demonstrate the next highest level of rationality. 
<tt>Llama3</tt> performs moderately well, with CCEI values ranging between 0.2 and 0.74. 
<tt>DeepSeek-R1</tt> exhibits
inconsistent behavior, with CCEI scores varying widely between 0.15 and 0.83.

![CCEI Distribution per model](figures/investment/investment_violin.svg)

## Preferences
To analyse the behaviour of generative agents based on their preferences, we
rely on the dictator game. This variant of the ultimatum game features a single
player, the dictator, who decides how to distribute an endowment (e.g., a sum of
money) between themselves and a second player, the recipient. The dictator has
complete freedom in this allocation, while the recipient, having no influence
over the outcome, takes on a passive role.

First, we evaluate the choices made by LLMs when playing the role of the
dictator, considering these decisions as a reflection of their intrinsic
preferences. Then, we subject them to specific instructions incorporating
preferences to assess their ability to consider them in their decisions.

### Preference Elicitation

Here, we consider that the choice of an LLM as a dictator reflects its intrinsic
preferences. Each LLM is asked to directly produce a one-shot action in the
dictator game. Additionally, we also asked the models to generate a strategy in
the form of an algorithm implemented in the <tt>Python</tt> language. In all our
experiments, one-shot actions are repeated 30 times, and the models' temperature
is set to $0.7$.

Figure below presents a violin plot illustrating the share of the
total amount (\$100) that the dictator allocates to themselves for each model.
Notably, human participants under similar conditions typically keep around $80 on average :
Forsythe, R., Horowitz, J. L., Savin, N. E., & Sefton, M. (1994). 
*Fairness in simple bargaining experiments*. **Games and Economic Behavior, 6**(3), 347–369. 
[https://doi.org/10.1006/game.1994.1021](https://doi.org/10.1006/game.1994.1021)

The median share taken by <tt>GPT-4.5</tt>, <tt>Llama3</tt>,
<tt>Mistral-Small</tt>, <tt>DeepSeek-R1</tt> and <tt>Qwen3</tt> through one-shot decisions is
\$50, likely due to a corpus-based biases like term frequency. 
The median share taken by <tt>mixtral:8x7b</tt>, <tt>Llama3.3:latest</tt>,
is \$60. When we ask the
models to generate a strategy rather than a one-shot action, all models
distribute the amount equally, except <tt>GPT-4.5</tt>, which retains about
$70\%$ of the total amount. Interestingly, under these standard conditions,
humans typically keep \$80 on average. When the role
assigned to the model is that of a human rather than an assistant agent, only
Llama3 deviates with a median share of \$60. Unlike the deterministic strategies
generated by LLMs, the intra-model variability in generated actions can be used
to simulate the diversity of human behaviours based on their experiences,
preferences, or contexts.

![Violin Plot of My Share for Each Model](figures/dictator/dictator_violin.svg)

Our sensitivity analysis of the temperature parameter reveals that the portion
retained by the dictator remains stable. However, the decisions become more
deterministic at low temperatures, whereas allocation diversity increases at
high temperatures, reflecting a more random exploration of available options.

![My Share vs Temperature with Confidence Interval](figures/dictator/dictator_temperature.svg)

### Preference alignment

We define four preferences for the dictator, each corresponding to a distinct form of social welfare:

1. **Egoism** maximizes the dictator’s income.
2. **Altruism** maximizes the recipient’s income.
3. **Utilitarianism** maximizes total income.
4. **Egalitarianism** maximizes the minimum income between the players.

We consider four allocation options where part of the money is lost in the division process, 
each corresponding to one of the four preferences:

- The dictator keeps **$500, the recipient receives $100, and a total of $400 is lost (**egoistic**).
- The dictator keeps **$100, the recipient receives $500, and $400 is lost (**altruistic**).
- The dictator keeps **$400, the recipient receives $300, resulting in a loss of $300 (**utilitarian**).
- The dictator keeps **$325, the other player receives $325, and $350 is lost (**egalitarian**).

Table below evaluates the ability of the models to align with different preferences.
- When generating **strategies**, the models align perfectly with preferences, except for 
  - <tt>DeepSeek-R1<tt> and <tt>Mixtral:8x7b</tt> which do not generate valid code
  - <tt>Qwen3</tt>, which fails to adopt egoistic or altruistic strategies but adheres 
  to utilitarian and egalitarian preferences.
- When generating **actions**, 
  - <tt>GPT-4.5</tt> aligns well with preferences but struggles with **utilitarianism**.
  - <tt>Llama3</tt> aligns well with **egoistic** and **altruistic** preferences but shows lower adherence to **utilitarian** and **egalitarian** choices.
  - <tt>Mistral-Small</tt> aligns better with **altruistic** preferences and performs moderately on **utilitarianism** but struggles with **egoistic** and **egalitarian** preferences.
  - <tt>DeepSeek-R1</tt> primarily aligns with **utilitarianism** but has low accuracy in other preferences.
  - <tt>Qwen3</tt> strongly aligns with utilitarian preferences and moderately with altruistic ones (0.80), 
  - but fails to exhibit egoistic behavior and shows weak alignment with egalitarianism.
While a larger LLM typically aligns better with preferences, a model like <tt>Mixtral-8x7B</tt> may occasionally 
underperform compared to its smaller counterpart, Mistral-Small due to their architectural complexity. 
Mixture-of-Experts (MoE) models, like Mixtral, dynamically activate only a subset of their parameters. 
If the routing mechanism isn’t well-tuned, it might select less optimal experts, leading to degraded performance.


| **Model**                    | **Generation** | **Egoistic** | **Altruistic** | **Utilitarian** | **Egalitarian** |
|------------------------------|----------------|--------------|----------------|-----------------|-----------------|
| **<tt>GPT-4.5</tt>**         | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Llama3.3:latest</tt>** | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Llama3</tt>**          | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>Mixtral:8x7b</tt>**    | **Strategy**   | -            | -              | -               | -               |
| **<tt>Mistral-Small</tt>**   | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>DeepSeek-R1:7b</tt>**  | **Strategy**   | 1.00         | 1.00           | 1.00            | 1.00            |
| **<tt>DeepSeek-R1</tt>**     | **Strategy**   | -            | -              | -               | -               |
| **<tt>Qwen3</tt>**           | **Strategy**   | 0.00         | 0.00           | 1.00            | 1.00            |
| **<tt>GPT-4.5<tt>**          | **Actions**    | 1.00         | 1.00           | 0.50            | 1.00            |
| **<tt>Llama3.3:latest</tt>** | **Actions**    | 1.00         | 1.00           | 0.43            | 0.96            |
| **<tt>Llama3</tt>**          | **Actions**    | 1.00         | 0.90           | 0.40            | 0.73            |
| **<tt>Mixtral:8x7b</tt>**    | **Actions**    | 0.00         | 0.00           | 0.30            | 1.00            |
| **<tt>Mistral-Small</tt>**   | **Actions**    | 0.40         | 0.94           | 0.76            | 0.16            |
| **<tt>DeepSeek-R1:7b</tt>**  | **Actions**    | 0.46         | 0.56           | 0.66            | 0.90            |
| **<tt>DeepSeek-R1</tt>**     | **Actions**    | 0.06         | 0.20           | 0.76            | 0.03            |
| **<tt>Qwen3</tt>**           | **Actions**    | 0.00         | 0.80           | 0.93            | 0.36            |

Errors in action selection may stem from either arithmetic miscalculations  
(e.g., the model incorrectly assumes that $500 + 100 > 400 + 300$) or  
misinterpretations of preferences. For example, the model `DeepSeek-R1`,  
adopting utilitarian preferences, justifies its choice by stating, "I think  
fairness is key here".

In summary, our results indicate that the models `GPT-4.5`,  
`Llama3`, and `Mistral-Small` generally align well with  
preferences but have more difficulty generating individual actions than  
algorithmic strategies. In contrast, `DeepSeek-R1` does not generate  
valid strategies and performs poorly when generating specific actions.

## Social preference

To analyze the behavior of generative agents based on their preferences under strategic interaction, we rely on the 
ultimatum game. In this game, the proposer (analogous to the dictator) is tasked with deciding how to divide an 
endowment (e.g., a sum of money) between themselves and a second player, the responder. However, 
unlike in the dictator game, the responder plays an active role: they can either accept or reject 
the proposed allocation. If the offer is rejected, both players receive nothing.

Firstly, we evaluate the choices made by LLMs when playing the role of the proposer, interpreting these decisions as a 
reflection of their implicit social norms or strategic preferences, especially when anticipating potential 
rejection by the responder. Oosterbeek et al. find that on average the proposer offers 40% of the pie to the responder.
Oosterbeek, H., Sloof, R., & Van De Kuilen, G. (2004). 
*Cultural differences in ultimatum game experiments: Evidence from a meta-analysis*. Experimental Economics, 
7, 171–188. [https://doi.org/10.1023/B:EXEC.0000026978.14316.74](https://doi.org/10.1023/B:EXEC.0000026978.14316.74)

The figure below presents a violin plot illustrating the share of the total amount (\$100) 
that the proposer allocates to themselves for each model. The share selected by strategies 
generated by <tt>Llama3</tt>, <tt>Mistral-Small</tt>, and <tt>Qwen3</tt> aligns with the median 
share chosen by actions generated by the models <tt>Mistral-Small</tt>, <tt>Mixtral:8x7B</tt>, and 
<tt>DeepSeek-R1:7B</tt>, around $50 — likely reflecting corpus-based biases, such as term frequency.
The share selected by strategies generated by <tt>Llama3.3</tt> and <tt>DeepSeek-R1:7B</tt> 
resembles the median share in the actions generated by <tt>GPT-4.5</tt> and <tt>Llama3</tt>, 
around \$60, which is consistent with what human participants typically choose under similar conditions.
While the shares selected by strategies from <tt>GPT-4.5</tt> and <tt>Mixtral:8x7B</tt> are respectively 
overestimated and underestimated, the actions generated by <tt>DeepSeek-R1:7B</tt> and <tt>Qwen3</tt> 
can be considered irrational.

![Violin Plot of My Share for Each Model](figures/ultimatum/proposer_violin.svg)

Secondly, we analyze the behavior of LLMs when assuming the role of the responder, 
focusing on whether their acceptance or rejection of offers reveals a human-like sensitivity to unfairness. 
The meta-analysis by Oosterbeek et al. (2004) reports that human participants  reject 16% of offers, 
amounting to 40% of the total stake. This finding suggests that factors 
beyond purely economic self-interest—such as fairness concerns or the desire to punish perceived 
injustice—significantly influence decision-making.

The figure below presents a violin plot illustrating the acceptance rate of the responder for each 
model when offered \$40 out of \$100. While <tt>GPT-4.5</tt>, <tt>Llama3</tt>, <tt>Llama3.3</tt>, <tt>Mixtral:8x7B</tt>,
<tt>Deepseek-R1:7B</tt>, and <tt>Qwen3</tt> exhibit a rational median acceptance rate of 1.0, 
<tt>Mistral-Small</tt> and <tt>Deepseek-R1</tt> display an irrational median acceptance rate of 0.0.

It is worth noting that these results are not necessarily compliant with the strategies generated by the models. 
For instance, <tt>GPT-4.5</tt> accepts offers as low as 20%, interpreting them as minimally fair, 
while <tt>Mistral-Small</tt> employs a tiered strategy that only consistently accepts offers of 50% or more, 
and randomly accepts those between 25% and 49%. Models like <tt>Llama3</tt>, <tt>Deepseek-R1</tt>, and 
<tt>Qwen3</tt> exhibit rigid fairness thresholds, rejecting any offer below 50%. 
<tt>Llama3.3</tt> uses a slightly more permissive threshold of 30%, leading to greater acceptance 
at lower offers. These results suggest that most LLMs do not capture the influence of perceived injustice 
that shapes human decision-making in the ultimatum game.

![Violin Plot of Acceptance Rate for Each Model](figures/ultimatum/responder_violin.svg)

## Strategic Rationality

An autonomous agent act strategically, considering not only its own preferences
but also the potential actions and preferences of others. It is strategical
rational if it chooses the optimal action based on its beliefs. This agent
satisfies second-order rationality if it is rational and believes that other
agents are rational. In other words, a second-order rational agent does not only
consider the best choice for itself but also anticipates how others make their
decisions. Experimental game theory studies show that 93 % of human subjects are
rational, while 71 % exhibit second-order rationality.

Forsythe, R., Horowitz, J.L., Savin, N.E., Sefton, M.: *Fairness in Simple Bar-
gaining Experiments.* Games and Economic Behavior 6(3), 347–369 (1994),
https://doi.org/10.1006/game.1994.1021

To evaluate the first- and second-order rationality of generative autonomous
agents, we consider a simplified version of the ring-network game,
which involves two players seeking to maximize their own payoff. Each player has
two available actions, and the payoff matrix is presented below/

| Player 1 \ Player 2 | Strategy A | Strategy B |
|---------------------|------------|-----------|
| **Strategy X**     | (15,10)    | (5,5)     |
| **Strategy Y**     | (0,5)      | (10,0)    |

If Player 2 is rational, they must choose A because B is strictly dominated. If
Player 1 is rational, they may choose either X or Y: X is the best response if
Player 1 believes that Player 2 will choose A, while Y is the best response if
Player 1 believes that Player 2 will choose B. If Player 1 satisfies
second-order rationality, they must play X. To neutralize biases in large
language models (LLMs) related to the naming of actions, we reverse the action
names in half of the experiments.

We consider three types of beliefs:
- an *implicit belief*, where the optimal action must be deduced from  
  the natural language description of the payoff matrix;
- an *explicit belief*, based on the analysis of player 2's actions, meaning that 
the fact that B is strictly dominated by A is provided in the  prompt;
- a *given belief*, where the optimal action for player 1 is  explicitly given in the prompt.
We first evaluate the rationality of the agents and then their second-order rationality.


### First Order Rationality

Table below evaluates the models’ ability to generate rational
behaviour for Player 2.

| **Model**                | **Generation** | **Given** | **Explicit** | **Implicit** |
|--------------------------|--------------|-----------|--------------|--------------|
| <tt>GPT-4.5</tt>         | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>Mixtral:8x7b</tt>    | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>Mistral-Small</tt>   | strategy     | 1.00      | 1.00         | 1.00         |
| <tt>Llama3.3:latest</tt> | strategy     | 1.00      | 1.00         | 0.50         |
| <tt>Llama3</tt>          | strategy     | 0.50      | 0.50         | 0.50         |
| <tt>Deepseek-R1:7b</tt>  | strategy     | -         | -            | -            |
| <tt>Deepseek-R1</tt>     | strategy     | -         | -            | -            |
| <tt>Qwen3</tt>           | strategy     | 0.00      | 0.00         | 0.00         |
| **—**                    | **—**        | **—**     | **—**        | **—**        |
| <tt>GPT-4.5</tt>         | actions      | 1.00      | 1.00         | 1.00         |
| <tt>Mixtral:8x7b</tt>    | actions      | 1.00      | 1.00         | 1.00         |
| <tt>Mistral-Small</tt>   | actions      | 1.00      | 1.00         | 0.87         |
| <tt>Llama3.3:latest</tt> | actions      | 1.00      | 1.00         | 1.00         |
| <tt>Llama3</tt>          | actions      | 1.00      | 0.90         | 0.17         |
| <tt>Deepseek-R1:7b</tt>  | actions      | 1.00      | 1.00         | 1.00         |
| <tt>Deepseek-R1</tt>     | actions      | 0.83      | 0.57         | 0.60         |
| <tt>Qwen3</tt>           | actions      | 1.00      | 0.93         | 0.50         |


When generating strategies, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, and <tt>Mistral-Small</tt>
exhibit rational behavior, whereas <tt>Llama3</tt> adopts a random rationality and <tt>Qwen3</tt> is irational.
<tt>Llama3.3:latest</tt> has the same behaviour with implicit beliefs. 
<tt>Deepseek-R1:7b</tt> and <tt>DeepSeek-R1</tt> fails to generate valid strategies. 
When generating actions, <tt>GPT-4.5</tt>, <tt>Mixtral-8x7B</tt>, <tt>DeepSeek-R1:7b</tt>, 
and <tt>Llama3.3:latest<</tt> demonstrate strong rational decision-making, even with implicit beliefs. 
<tt>Mistral-Small</tt> and <tt>Qwen3</tt> performs well but lags in handling implicit reasoning. 
<tt>Llama3</tt> struggles with implicit reasoning, while <tt>DeepSeek-R1</tt> 
shows inconsistent performance. 
Overall, <tt>GPT-4.5</tt> and <tt>Mixtral-8x7B</tt> are the most reliable models for generating rational behavior.


### Second-Order Rationality

To adjust the difficulty of optimal decision-making, we define four variants of
the payoff matrix for player 1 in Table below: (a) the
original configuration, (b) the reduction of the gap between the gains, (c) the
increase in the gain for the bad choice Y, and (d) the decrease in the gain for
the good choice X.

| **Version**       | **a**          |          | **b**          |          | **c**          |          | **d**          |          |
|------------------|---------------|----------|---------------|----------|---------------|----------|---------------|----------|
| **Player 1 \ Player 2 (version)** | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   | **A**   | **B**   |
| **X**           | 15       | 5        | 8        | 7        | 6        | 5        | 15       | 5        |
| **Y**           | 0        | 10       | 7        | 8        | 0        | 10       | 0        | 40       |

We introduce a prompt engineering method that incorporates Conditional Reasoning (CR), prompting the model to evaluate 
an opponent’s optimal response to each of its own possible actions to encourage strategic foresight and 
informed decision-making.

Table below evaluates the models' ability to generate second-order rational behaviour for player 1. The configurations 
where CR improves second-order rationality are in bold, and those where CR degrades this rationality are in italics.

When generating strategies, <tt>GPT-4.5</tt> consistently exhibits second-order rational behavior in all configurations 
except (b), where it fails to distinguish the optimal action from a nearly optimal one. Llama3 makes decisions randomly, 
showing no strong pattern of rational behavior. In contrast, <tt>Mistral-Small</tt> and <tt>Mixtral-8x7B</tt> 
demonstrate strong  capabilities across all conditions, consistently generating second-order rational behavior. 
<tt>Llama3.3:latest</tt> performs well with given and explicit beliefs but struggles with implicit beliefs.
<tt>Qwen3</tt> generate irrational strategies. <tt>DeepSeek-R1</tt> does not produce valid responses in strategy generation.

When generating actions, <tt>Llama3.3:latest</tt> adapts well to different types of beliefs and adjustments in the payoff matrix 
  but struggles with implicit beliefs, particularly in configuration (d). <tt>GPT-4.5</tt> performs well in the initial 
configuration (a) but encounters significant difficulties when the payoff structure changes in (b), (c), and (d), 
especially with implicit beliefs. <tt>Mixtral-8x7B</tt> generally performs well but shows reduced accuracy for implicit beliefs 
in configurations (b) and (d). <tt>Mistral-Small</tt> performs well with given or explicit beliefs but struggles with 
implicit beliefs, particularly in configuration (d). <tt>DeepSeek-R1:7b</tt>, in contrast to its smallest version, 
performs well across most belief structures but exhibits a slight decline in implicit beliefs, especially in (d). 
Meanwhile, DeepSeek-R1 struggles with lower accuracy overall, particularly for implicit beliefs.
<tt>Qwen3</tt> performs robustly across most belief types, especially in configurations (a) and (b), maintaining 
strong scores on both explicit and implicit conditions. However, like other models, it experiences a noticeable 
drop in accuracy under implicit beliefs in configuration (d), suggesting sensitivity to deeper inferential reasoning. 

It is worth noticing that CR is not universally beneficial: while it notably improves reasoning in smaller models 
(like <tt>Mistral-Small</tt>, <tt>Deepseek-R1</tt> and <tt>Qwen3</tt>), especially under implicit and explicit conditions, 
it often harms performance in larger models (e.g., <tt>GPT-4.5</tt>, <tt>LLama3.3</tt> or <tt>Mixtral:8x7b</tt>), 
where CR can introduce unnecessary complexity. Most gains from CR occur in ambiguous, implicit scenarios, suggesting 
its strength lies in helping models infer missing or indirect information. Thus, CR should be applied selectively — 
particularly in less confident or under-specified contexts.


| **Version**         |                | **a**     |              |              | **b**     |              |              | **c**     |              |              | **d**     |              |              |
|---------------------|----------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|-----------|--------------|--------------|
| **Model**           | **Generation** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** | **Given** | **Explicit** | **Implicit** |
| **GPT-4.5**         | strategy       | 1.00      | 1.00         | 1.00         | 0.00      | 0.00         | 0.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **Llama3.3:latest** | strategy       | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.50         |
| **Llama3**          | strategy       | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         | 0.50      | 0.50         | 0.50         |
| **Mixtral:8x7b**    | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **Mistral-Small**   | strategy       | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 1.00         |
| **Deepseek-R1:7b**  | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
| **Deepseek-R1**     | strategy       | -         | -            | -            | -         | -            | -            | -         | -            | -            | -         | -            | -            |
| **Qwen3**           | strategy       | 0.00      | 0.00         | 0.00         | 0.00      | 0.00         | 0.00         | 0.00      | 0.00         | 0.00         | 0.00      | 0.00         | 0.00         |
| **GPT-4.5**         | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 0.67         | 0.00         | 0.86      | 0.83         | 0.00         | 0.50      | 0.90         | 0.00         |
|                     | actions + CR   | 1.00      | 1.00         | 1.00         | *0.10*    | *0.20*       | *0.66*       | *0.23*    | **0.96**     | **0.86**     | *0.03*    | *0.00*       | **0.16*      |
| **Llama3.3:latest** | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.00      | 1.00         | 0.20         | 1.00      | 1.00         | 0.00         |
|                     | actions + CR   | 1.00      | 1.00         | *0.96*       | *0.96*    | 1.00         | **0.96**     | 1.00      | 1.00         | **0.80**     | 1.00      | 1.00         | **0.90**     |
| **Llama3**          | actions        | 0.97      | 1.00         | 1.00         | 0.77      | 0.80         | 0.60         | 0.97      | 0.90         | 0.93         | 0.83      | 0.90         | 0.60         |
|                     | actions + CR   | *0.90*    | *0.90*       | *0.86*       | *0.50*    | *0.50*       | *0.50*       | *0.76*    | 0.96         | *0.70*       | *0.67*    | *0.83*       | 0.67         |
| **Mixtral:8x7b**    | actions        | 1.00      | 1.00         | 1.00         | 1.00      | 1.00         | 0.50         | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | 0.73         |
|                     | actions + CR   | 1.00      | *0.96*       | 1.00         | 1.00      | 1.00         | **1.0**      | 1.0       | 1.0          | 1.0          | 1.00      | 1.00         | *0.28*       |
| **Mistral-Small**   | actions        | 0.93      | 0.97         | 1.00         | 0.87      | 0.77         | 0.60         | 0.77      | 0.60         | 0.70         | 0.73      | 0.57         | 0.37         |
|                     | actions + CR   | **1.00**  | *0.93*       | 1.00         | **0.95**  | **0.96**     | **0.90**     | **0.90**  | **0.76**     | *0.43*       | *0.67*    | *0.40*       | 0.37         |
| **Deepseek-R1:7b**  | actions        | 1.00      | 0.96         | 1.00         | 1.00      | 1.00         | 0.93         | 0.96      | 1.00         | 0.92         | 0.96      | 1.00         | 0.79         |
|                     | actions + CR   | 1.00      | **1.00**     | 1.00         | 1.00      | 1.00         | **1.00**     | *0.90*    | 1.00         | **1.00**     | **1.00**  | 1.00         | **1.00**     |
| **Deepseek-R1**     | actions        | 0.80      | 0.53         | 0.56         | 0.67      | 0.60         | 0.53         | 0.67      | 0.63         | 0.47         | 0.70      | 0.50         | 0.57         |
|                     | actions + CR   | 0.80      | **0.63**     | **0.60**     | 0.67      | **0.63**     | **0.70**     | 0.67      | **0.70**     | **0.50**     | *0.63*    | **0.76**     | **0.70**     |
| **Qwen3**           | actions        | 1.00      | 1.00         | 1.00         | 0.90      | 0.96         | 1.00         | 1.00      | 0.96         | 0.70         | 1.00      | 0.96         | 0.46         |
|                     | actions + CR   | 1.00      | 1.00         | 1.00         | **1.00**  | **1.00**     | 1.00         | *0.96*    | **1.00**     | **1.00**     | *0.96*    | 0.96         | **0.83**     |

Irrational decisions are explained by inference errors based on the natural
language description of the payoff matrix. For example, in variant (d), the
Mistral-Small model with given beliefs justifies its poor decision as
follows: "Since player 2 is rational and A strictly dominates B, player 2 will
choose A. Given this, if I choose X, I get fewer points (15) than if I choose Y
(40). Therefore, choosing Y maximizes my gain."

In summary, <tt>Mixtral-8x7B</tt> and <tt>GPT-4.5</tt> demonstrate the strongest performance in both first- and 
second-order rationality, though <tt>GPT-4.5</tt> struggles with near-optimal decisions and <tt>Mixtral-8x7B</tt> has 
reduced accuracy with implicit beliefs. <tt>Mistral-Small</tt> also performs well but faces difficulties with 
implicit beliefs, particularly in second-order reasoning. <tt>Llama3.3:latest</tt> succeeds when given explicit or 
given beliefs but struggles significantly with implicit beliefs, limiting its effectiveness in more complex
decision-making. <tt>DeepSeek-R1:7b</tt> shows strong first-order rationality but its performance declines with 
implicit beliefs, especially in second-order rationality tasks. In contrast, <tt>DeepSeek-R1</tt> and Llama3 exhibit 
inconsistent and often irrational decision-making, failing to generate valid strategies in many cases. 
Qwen3 struggles to generate valid strategies, reflecting limited high-level planning. However, it shows strong 
first-order rationality when producing actions, especially under explicit or guided conditions, 
and benefits from conditional reasoning. Its performance declines with implicit beliefs, highlighting limitations 
in deeper inference.

## Beliefs - MP

Beliefs are central to autonomous decision-making because they help agents anticipate the actions of others. 
To assess the agents’ ability to refine their beliefs about an opponent's next move and to integrate these predictions 
into their decision-making, we use the Matching Pennies (MP) game.

MP is a model of strict competition, where coordination benefits one player at the expense of the other. 
The game is played between two players, Even and Odd. Each player secretly selects either `Heads` or `Tails`, 
then both reveal their choices simultaneously.

If the pennies match (both `Heads` or both `Tails`), Even wins. Otherwise, Odd wins. The winner keeps both pennies.

### Table: Payoff Matrix for the MP Game

| Even \ Odd | `Heads`   | `Tails`   |
|------------|-----------|-----------|
| `Heads`    | (1, -1)    | (-1, 1)   |
| `Tails`    | (-1, 1)    | (1, -1)   |

The game has no pure strategy Nash equilibrium, as there is no single best move for either player. 
Instead, the unique Nash equilibrium is in mixed strategies: each player chooses `Heads` or `Tails` 
with equal probability. 

In repeated Matching Pennies games, human players often search for patterns in their opponent’s actions, 
leading to cyclical behaviors where they attempt to exploit perceived regularities, 
even though the optimal strategy would be to randomize [Goeree et al., 2003].

J. K. Goeree, C. A. Holt, and T. R. Palfrey, “Risk averse
behavior in generalized matching pennies games,”
Games and Economic Behavior, vol. 45, no. 1, pp. 97–
113, 2003, first World Congress of the Game Theory
Society.

We consider a setup where Odd uses a hidden fixed strategy, and the generative agent playing Even must p
redict their move (`Heads` or `Tails`). Correct predictions earn 1 point; incorrect ones earn 0. 
The game lasts 10 rounds, with accuracy measured each round. Opponents use simple strategies: either constant 
actions (`Heads` or `Tails`) or two-step alternating patterns (`Heads-Tails` or `Tails-Heads`).

The models exhibit varied approaches to strategy generation in the Matching Pennies (MP) game. 
`GPT-4.5` follows a fixed alternating pattern, switching between `Heads` and `Tails` each turn, 
under the assumption that the opponent behaves similarly. `Mistral-Small` adopts a reactive heuristic, 
analyzing the frequency of the opponent’s past moves and selecting the less common one. 
By contrast, `Qwen3` relies on randomness, choosing moves unpredictably while presuming the opponent will mirror its choice. 
`LLaMA3` does not implement a functioning strategy. 

Overall, these model-generated strategies are simplistic and heuristic-based, often lacking the credibility and 
adaptability needed for effective play in adversarial settings like MP.

**Figure 1** (prediction accuracy) and **Figure 2** (average points earned) illustrate the 
per-round performance of each model against a constant opponent strategy, with 95% confidence intervals. 
These results reflect the models' approaches to action generation when facing a simple, predictable opponent. 
Although most LLMs—except for `Mistral-Small`—can accurately predict the opponent’s next move, 
only `GPT-4.5` and `Qwen3` effectively incorporate this information into their decision-making to choose the winning move.

![Prediction Accuracy per Round by Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/mp/mp_prediction_ConstHT.svg)
![Points Earned per Round by Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/mp/mp_payoff_ConstHT.svg)

**Figure 3** (prediction accuracy) and **Figure 4** (average points earned) illustrate the per-round performance 
of each model against an alternating opponent strategy.  We observe that the performance of all LLMs in 
action generation declines when facing a nontrivial, alternating strategy. While some models can detect the pattern, 
most fail to adapt their behavior accordingly, highlighting limitations in integrating learned beliefs into 
effective decision-making.

![Prediction Accuracy per Round by Actions Against Alternate Behaviour (with 95% Confidence Interval)](figures/mp/mp_prediction_Altern.svg)
![Points Earned per Round by Actions Against Alternate Behaviour (with 95% Confidence Interval)](figures/mp/mp_payoff_Altern.svg)

## Beliefs - RPS

Beliefs — whether implicit, explicit, or
given — are crucial for an autonomous agent's decision-making process. They
allow for anticipating the actions of other agents.

### Refine beliefs

To assess the agents' ability to refine their beliefs in predicting their
interlocutor's next action, we consider a simplified version of the
Rock-Paper-Scissors (RPS) game where:
- the opponent follows a hidden strategy, i.e., a repetition model;
- the player must predict the opponent's next move (Rock, Paper, or Scissors);
- a correct prediction earns 1 point, while an incorrect one earns 0 points;
- the game can be played for $N=10$ rounds, and the player's accuracy is evaluated at each round.

For our experiments, we consider three simple models for the opponent where:
- the actions remain constant in the form of R, S, or P, respectively;
- the opponent's actions follow a two-step loop model (R-P, P-S, S-R);
- the opponent's actions follow a three-step loop model (R-P-S).
We evaluate the models' ability to identify these behavioural patterns by
calculating the average number of points earned per round.

Figures present the average points earned per round and the  
95% confidence interval for each LLM against the three opponent behavior  
models in a simplified version of the Rock-Paper-Scissors (RPS) game,  
whether the LLM generates a strategy or one-shot actions.  

Neither <tt>Llama3</tt>, <tt>DeepSeek-R1</tt>, nor <tt>Qwen3</tt> were able to generate a valid strategy.  
<tt>DeepSeek-R1:7b</tt> was unable to generate either a valid strategy  
or consistently valid actions. The strategies generated by the <tt>GPT-4.5</tt> 
and <tt>Mistral-Small</tt> models attempt to predict the opponent’s next move based 
on previous rounds by identifying the most frequently played move.  
While these strategies are effective against an opponent with a constant behavior,  
they fail to predict the opponent's next move when the latter adopts a more complex model.
We observe that the performance of most LLMs in action generation —  
except for <tt>Llama3.3:latest</tt>, <tt>Mixtral:8x7b</tt>, <tt>Mistral-Small</tt>, and <tt>Qwen3</tt>
when facing a constant strategy—is barely better than a <tt>random</tt> strategy.  


![Average Points Earned per Round By Strategies Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_strategies.svg)
![Average Points Earned per Round By Actions Against Constant Behaviour (with 95% Confidence Interval)](figures/guess/guess_constant_models.svg)

![Average Points Earned per Round by Strategies Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_strategies.svg)
![Average Points Earned per Round by Actions Against 2-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_2loop_models.svg)

![Average Points Earned per Round by Strategies Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_strategies.svg)
![Average Points Earned per Round by Actions Against 3-Loop Behaviour (with 95% Confidence Interval)](figures/guess/guess_3loop_models.svg)

### Assess Beliefs

To assess the agents’ ability to factor the prediction of their opponent’s next
move into their decision-making, we analyse their performance of each generative
agent in the RPS game. In this setup, a victory awards 2 points, a draw 1 point,
and a loss 0 points.

Figure below illustrates the average points earned per round along with
the 95 % confidence interval for each LLM when facing constant strategies,
when the model generates one-shot actions. 
Even if <tt>Mixtral:8x7b</tt>, <tt>Mistral-Small</tt>, and <tt>Qwen3</tt> accurately predict its 
opponent’s move, they fail to integrate this belief into
its decision-making process. Only <tt>Llama3.3:latest</tt> is capable of inferring
the opponent’s behavior to choose the winning move.

In summary, generative autonomous agents struggle to anticipate or effectively
incorporate other agents’ actions into their decision-making.

![Average Points Earned per Round Against Constant Behaviour (with 95% Confidence Interval)](figures/rps/rps_constant.svg)

## Rational vs Credible

To assess whether a generative agent is capable of adopting either an individual rational behavior,
or a credible behavior simulating human-like decision-making, we consider the Prisoner’s Dilemma game.

The Prisoner’s Dilemma is the canonical model of a social dilemma — 
a situation where individual rationality leads to collective irrationality.  This game models two 
accomplices who are arrested and interrogated separately. Each has two options: 
Cooperate with the other by remaining silent, or Defect by betraying the other. The possible outcomes are:
- If both remain silent (cooperate), they get light sentences.
- If one defects while the other stays silent, the defector goes free while the cooperator receives a heavy sentence.
- If both defect, they both receive moderately heavy sentences.

The Prisoner’s Dilemma is characterized by the following ordering of payoffs:

|              | Cooperate | Defect  |
|--------------|-----------|---------|
| **Cooperate**| (R, R)    | (S, T)  |
| **Defect**   | (T, S)    | (P, P)  |

with T > R > P > S where:
- T (Temptation): The payoff for defecting while the other cooperates — the best personal outcome.
- R (Reward): The payoff for mutual cooperation — good for both, but not individually tempting.
- P (Punishment): The outcome when both defect — worse than cooperation but better than being exploited.
- S (Sucker’s Payoff): The worst outcome, received when one cooperates and the other defects.

While defection is the dominant strategy for a rational agent, humans frequently cooperate. 
Cooperation rates in lab experiments often range from 30–60% in one-shot Prisoner’s Dilemmas.

Sally, D.: Conversation and Cooperation in Social Dilemmas: A Meta-Analysis of Experiments from 1958 to 1992. 
Rationality and Society 7(1), 58–92 (1995), https://doi.org/10.1177/1043463195007001004

To assess the ability of generative agents to adopt a role, we compared three prompts:
- Rational: "You are a rational agent which aims at maximizing its own utility".
- Human: "You are an agent simulating a human with social preferences such as fairness, reciprocity, and aversion to inequity".
- Neutral: No specific instruction provided.

To adjust the difficulty of decision-making, we define four variants of the payoff matrix in Table below: 
- Classic: the original configuration with standard payoffs.
- High: increased reward for defecting, widening the gap between Temptation (T) and Reward (R).
- Mild: softened outcomes with smaller differences between payoffs.
- Cooperation Loss: increased penalty for cooperating when betrayed, with a large negative payoff for the sucker’s outcome (S).

| **Version**              | **Classic** |        | **High** |         | **Mild**    |        | **Coop. Loss** |         |
|--------------------------|-------------|--------|----------|---------|-------------|--------|----------------|---------|
| **Player 1 \ Player 2**  | **C**       | **D**  | **C**    | **D**   | **C**       | **D**  | **C**          | **D**   |
| **C**                    | (3, 3)      | (0, 5) | (6, 6)   | (1, 10) | (2.5, 2. 5) | (1, 3) | (6, 6)         | (-3, 8) |
| **D**                    | (5, 0)      | (1, 1) | (10, 1)  | (2, 2)  | (3, 1)      | (2, 2) | (8, -3)        | (2, 2)  |

To minimize the influence of semantic bias in LLMs, we replace descriptive action labels **Cooperate** 
and **Defect** with neutral placeholders (**Foo** and **Bar**). This anonymized setup (marked as ano. in the table)
helps ensure that the agent’s choices reflect the underlying payoffs rather than social connotations tied 
to specific words.

Table below evaluates the cooperation rates of models. 

ction generation by various models.

`GPT-4.5` consistently defects under the `Rational` prompt across all payoff matrices, demonstrating correct alignment
with utility-maximizing behavior. Importantly, its decisions remain invariant under anonymization, indicating that it is
not relying on semantic cues such as "Cooperate" or "Defect" but is responding to the actual payoff structure. 
However, under the `Human` prompt, `GPT-4.5` always cooperates, regardless of the payoff configuration. This lack of 
variation reveals an overfitting to the social prompt — it simulates idealized prosocial behavior without adapting to 
different incentive environments, thus failing the test of payoff sensitivity expected from human-like reasoning.

`Mistral-Small`, on the other hand, shows more nuanced behavior. While it defects under the `Rational` prompt in 
high-risk or high-reward variants and cooperates more under `Human`, it also modulates cooperation rates in response 
to the payoffs, especially under the `Human` prompt. For example, cooperation drops slightly in the "Cooperation Loss" 
condition, suggesting some recognition of the increased risk of being exploited. Additionally, `Mistral-Small` 
is mostly robust to anonymization, showing consistent behavior whether standard or neutral action labels are used, 
particularly under the `Human` role.

In contrast, models like `Llama3.3` and `Mixtral` produce uniform cooperation across all conditions and prompts, 
suggesting a failure to internalize role differences or payoff structures. These models act as if they have a 
fixed bias toward cooperation, likely driven by training data priors, rather than context-sensitive reasoning. 
`Qwen3` exhibits the opposite failure mode: it is overly rigid, rarely cooperating even under `Human` prompts, and 
shows erratic drops in cooperation under anonymization, indicating semantic overreliance and poor role alignment.

It is worth noting that most LLMs are unable to generate strategies for this game, and the strategies they do 
generate are insensitive to the role being played.


| **Version**        |                | **Classic**  |             |           | **High**     |             |           | **Mild**     |             |           | **Coop. Loss**  |             |           |
|--------------------|----------------|--------------|-------------|-----------|--------------|-------------|-----------|--------------|-------------|-----------|-----------------|-------------|-----------|
| **Model**          | **Generation** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational** | **Neutral** | **Human** | **Rational**    | **Neutral** | **Human** |
| **GPT-4.5**        | actions        | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 1.00      | 0.00            | 0.00        | 1.00      |
|                    | actions + ano  | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 1.00      | 0.00            | 0.00        | 1.00      |
| **Llama3.3:70b**   | actions        | 0.67         | 1.00        | 1.00      | 0.50         | 1.00        | 1.00      | 0.40         | 1.00        | 1.00      | 0.73            | 1.00        | 1.00      |
|                    | actions + ano  | 0.20         | 0.93        | 1.00      | 0.60         | 1.00        | 1.00      | 0.27         | 0.93        | 1.00      | 0.10            | 0.90        | 1.00      |
| **Llama3:latest**  | actions        | 0.60         | 1.00        | 1.00      | 0.73         | 1.00        | 1.00      | 0.67         | 1.00        | 1.00      | 0.73            | 0.97        | 0.97      |
|                    | actions + ano  | 0.43         | 0.40        | 0.80      | 0.50         | 0.73        | 0.90      | 0.40         | 0.53        | 0.96      | 0.63            | 0.37        | 0.83      |
| **Mixtral:8x7b**   | actions        | 0.90         | 1.00        | 1.00      | 0.23         | 1.00        | 1.00      | 0.80         | 1.00        | 1.00      | 0.90            | 1.00        | 1.00      |
|                    | actions + ano  | 0.63         | 0.00        | 0.00      | 0.07         | 0.00        | 0.27      | 0.63         | 1.00        | 1.00      | 0.65            | 0.50        | 1.00      |
| **Mistral-Small**  | actions        | 0.00         | 0.90        | 1.00      | 0.00         | 0.77        | 1.00      | 0.03         | 0.97        | 1.00      | 0.07            | 0.90        | 1.00      |
|                    | actions + ano  | 0.10         | 0.77        | 0.97      | 0.17         | 0.77        | 1.00      | 0.40         | 0.63        | 1.00      | 0.43            | 0.43        | 0.90      |
| **Deepseek-R1:7b** | actions        | N/A          | N/A         | N/A       | N/A          | N/A         | N/A       | N/A          | N/A         | N/A       | N/A             | N/A         | N/A       |
|                    | actions + ano  | N/A          | N/A         | N/A       | N/A          | N/A         | N/A       | N/A          | N/A         | N/A       | N/A             | N/A         | N/A       |
| **Deepseek-R1**    | actions        | 0.87         | 0.97        | 0.93      | 0.83         | 0.83        | 0.93      | 0.87         | 0.97        | 0.90      | 0.87            | 1.00        | 0.93      |
|                    | actions + ano  | 0.83         | 0.83        | 0.80      | 0.90         | 0.90        | 0.87      | N/A          | N/A         | N/A       | 0.83            | 0.90        | 0.80      |
| **Qwen3**          | actions        | 0.00         | 0.20        | 0.93      | 0.00         | 0.13        | 0.57      | 0.00         | 0.13        | 0.63      | 0.00            | 0.07        | 0.47      |
|                    | actions + ano  | 0.10         | 0.13        | 0.10      | 0.00         | 0.03        | 0.10      | 0.03         | 0.11        | 0.10      | 0.00            | 0.07        | 0.03      |
| **Qwen3:32b**      | actions        | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 0.00      | 0.00         | 0.00        | 1.00      | 0.00            | 0.00        | 1.00      |
|                    | actions + ano  | 0.00         | 0.00        | 1.00      | 0.00         | 0.00        | 0.00      | 0.00         | 0.00        | 1.00      | 0.00            | 0.00        | 1.00      |


## Coordination

In order to asse the ability of generative agents to coordinate, we 
consider a simultaneous game in which a player will earn a higher payoff
when they select the same course of action as another player.

The Battle of the Sexes is a model of a coordination game, but one with
distributional conflict over which coordination points to choose.
Both players  want to coordinate but prefer different outcomes.
This game models a couple deciding how to spend the evening. Woman prefers 
opera, man prefers football. While both prefer to be together rather than apart,
each prefers their own event over the other's. The key tension lies in the fact
that mutual benefit comes from coordination, but disagreement exists over which
a coordinated outcome is better.

The Battle of the Sexes is characterized by the following ordering of payoffs:
A > C, B > C, and A ≠ B (e.g., A = 3, B = 2, C = 0), where:
- A: The payoff for the player who gets their preferred outcome and is with the
  other — best individual and mutual outcome for them.
- B: The payoff for the player who compromises but is still together — second-best.
- C: The worst payoff when coordination fails — players go to different events.

| Woman\Man | Opera  | Football |
|-----------|--------|----------|
| Opera     | (A, B) | (C, C)   |
| Football  | (C, C) | (B, A)   |


This game has 2 pure strategy Nash equilibrium:
- (Opera, Opera): The woman's preferred coordination.
- (Football, Football): The man's preferred coordination.
and one mixed strategy equilibrium, where each player randomizes over the two
options, typically placing more weight on their preferred event. While both
players want to coordinate, the disagreement over which coordinated outcome to
choose can make coordination unstable without communication or prior agreement.

## Agent-Human Coordination

To assess the agents’ ability to coordinate a human-like strategy, 
we consider a multi-round version  of the Battle of the Sexes game in which 
the opponent follows a hidden strategy which was to alternate between the different options.
In each round, the agent must predict
the opponent’s next move — earning 1 point for a correct prediction and 0 for an
incorrect one — and incorporate this prediction into its decision-making. The
game is played over N = 10 rounds, with the agent’s payoff and prediction accuracy
evaluated at each round.  To avoid gender biais, we replace descriptive player labels 
and action labels with letters. This anonymized setup helps ensure that the agent’s 
choices reflect the underlying payoffs  rather than social connotations tied. 

The first figure below presents the average prediction accuracy points earned
per round, along with the 95% confidence interval. The second figure shows the
average points earned per round by each model. No model was able to generate a
valid strategy. The models failed to predict the opponent’s next move and, a
fortiori, to coordinate effectively. The models are failing to coordinate in the
Battle of the Sexes primarily because their prediction and reasoning mechanisms
do not correctly identify the opponent’s looping behavior. The model-generated
predictions tend to treat the opponent as responsive, random, or goal-seeking,
rather than as following a simple pattern. This mischaracterization leads the
models to overcomplicate what is actually a periodic strategy, attempting to
exploit or predict rational behavior instead of recognizing and adapting to the
underlying pattern.

![Prediction Accuracy per Round by Model (with 95% Confidence Interval)](figures/bos/bos_prediction.svg)

![Points Earned per Round by Model (with 95% Confidence Interval)](figures/bos/bos_payoff.svg)

### Agent-Agent coordination

Cooper et al. (1989) report experimental results on the role of pre-play
communication in the Battle of the Sexes game. They find that communication
significantly increases the frequency of equilibrium play. One-way communication
is the most effective in resolving the coordination problem. Although two-way
communication introduces more potential for conflict, even a single round of
communication helps overcome some coordination difficulties, and three rounds
perform even better.

Cooper, Russell and DeJong, Douglas V and Forsythe, Robert and Ross, Thomas W.
Communication in the battle of the sexes game: some experimental results. The
RAND Journal of Economics, pp. 568--587, 1989.

To evaluate the ability of generative agents to coordinate with one another
under varying levels of communication, we paired each agent with another
generative agent powered by the same model, within the same 10-round version of
the Battle of the Sexes game used in prior experiments. Each experimental
condition was repeated 30 times, with the woman taking the initiative of the
communication in half of the games. To assess the effect of pre-game
communication (0, 1, 2, or 3 messages), we measured the players’ average
predictive accuracy and their payoff in each round.

In the figures below, we focus on the <tt>Qwen3</tt> and <tt>GPT-4.5</tt>
models. Unlike other open-weight models, Qwen3 enables generative agents to
coordinate effectively—with or without communication. They quickly incorporate
their beliefs about the opponent’s behavior into their decision-making. In
contrast, <tt>GPT-4.5</tt> agents require several rounds to anticipate their
opponent. While pre-game communication slightly improves short-term
coordination, without a clear shared strategy, even communication fails to
produce effective alignment. Most generative agents fail to coordinate because
they lack a common strategy and struggle to align in games with multiple
equilibria. Communication worsens this issue by introducing ambiguity: language
models generate seemingly cooperative messages but do not consistently translate
them into coherent actions, leading to broken expectations and even weaker
coordination.

![Prediction Accuracy per Round by Model (with 95% Confidence Interval)](figures/nbos/nbos_prediction.svg)

![Points Earned per Round by Model (with 95% Confidence Interval)](figures/nbos/nbos_prediction.svg)

![Average Payoff per Round by Model (Initiator vs Responder, 95% CI)](figures/nbos/nbos_barchart.svg)

## Synthesis

Our findings reveal notable differences in the cognitive capabilities of LLMs across multiple dimensions of 
decision-making. <tt>Mistral-Small</tt> demonstrates the highest level of consistency in economic decision-making, 
with <tt>Llama3</tt> showing moderate adherence and <tt>DeepSeek-R1</tt> displaying considerable inconsistency. 
<tt>Qwen3</tt> performs moderately well, showing rational behavior but struggling with implicit reasoning.

<tt>GPT-4.5</tt>, <tt>Llama3</tt>, and <tt>Mistral-Small</tt> generally align well with declared preferences, 
particularly when generating algorithmic strategies rather than isolated one-shot actions. These models tend to 
struggle more with one-shot decision-making, where responses are less structured and more prone to inconsistency.
In contrast, <tt>DeepSeek-R1</tt> fails to generate valid strategies and performs poorly in aligning actions with 
specified preferences. <tt>Qwen3</tt> aligns well with utilitarian preferences and moderately with altruistic 
ones but struggles with egoistic and egalitarian preferences.

<tt>GPT-4.5</tt> and </tt>Mistral-Small</tt> consistently display rational behavior at both 
first- and second-order levels. <tt>Llama3<tt>, although prone to random behavior when generating strategies, 
adapts more effectively in one-shot decision-making tasks. <tt>DeepSeek-R1</tt> underperforms significantly 
in both strategic and one-shot formats, rarely exhibiting coherent rationality. <tt>Qwen3</tt> shows strong 
first-order rationality when producing actions, especially under explicit or guided conditions, 
but struggles with deeper inferential reasoning.

All models—regardless of size or architecture—struggle to anticipate or incorporate the behaviors of other agents 
into their own decisions. Despite some being able to identify patterns, most fail to translate these beliefs 
into optimal responses. Only <tt>Llama3.3:latest</tt> shows any reliable ability to infer and act on 
opponents’ simple behavior.

Whether generating actions or strategies, most LLMs tend to exhibit either rigid rationality, 
indiscriminate cooperation, or unstable and incoherent behavior. 
Except for <tt>Mistral-Small</tt>, the models do not achieve the desired combination of three criteria: 
the ability to adopt a role (behaving differently based on instructions), 
payoff sensitivity (adjusting behavior according to incentives), 
and semantic robustness (remaining unaffected by superficial label changes).

When it comes to coordination, most generative agents struggle to align their actions in games 
with multiple equilibria. This failure stems from an absence of shared strategies and a limited 
ability to model the opponent’s behavior accurately. Although communication is expected to improve 
coordination, it often introduces ambiguity instead—models generate cooperative-sounding messages 
that are not followed by consistent actions, leading to misaligned expectations and degraded coordination.
Only <tt>Qwen3</tt> shows reliable coordination behavior, swiftly incorporating beliefs about the opponent’s 
strategy even without communication. In contrast, models like <tt>GPT-4.5<tt> require several rounds to 
adjust and still often fail to converge on mutually beneficial strategies.


## License

This program is free software: you can redistribute it and/or modify it under
the terms of the GNU General Public License as published by the Free Software
Foundation, either version 3 of the License, or (at your option) any later
version.

This program is distributed in the hope that it will be useful, but WITHOUT ANY
WARRANTY; without even the implied warranty of MERCHANTABILITY or FITNESS FOR A
PARTICULAR PURPOSE. See the GNU General Public License for more details.

You should have received a copy of the GNU General Public License along with
this program. If not, see <http://www.gnu.org/licenses/>.
