Trust-Aware Reinforcement Learning Agents in the Iterated Prisoners’ Dilemma: Integrating MCTS and UCT for Optimal Cooperation

Kevin Babashov; Maria Gini

Trust-Aware Reinforcement Learning Agents in the Iterated Prisoners’ Dilemma: Integrating MCTS and UCT for Optimal Cooperation

Kevin Babashov, Maria Gini

Published: 03 Jun 2026, Last Modified: 03 Jun 2026ALA 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: Trust, Iterated Prisoner’s Dilemma, Cooperation, TRAVOS, MCTS, GNN

TL;DR: We study whether explicit computational trust improves learning and decision-making in the Iterated Prisoner’s Dilemma (IPD) un- der heterogeneous and deceptive opponents.

Abstract: We study whether explicit computational trust improves learning and decision-making in the Iterated Prisoner’s Dilemma (IPD) un- der heterogeneous and deceptive opponents. We implement five trust mechanisms: Personal (direct experience), TRAVOS-like (direct plus discounted witness reports), Hearsay (witness-only), Bayesian belief-based (latent-type inference), and Adversarial (malicious re- porting). These models connect to action selection through a trust- conditioned control interface. Trust estimates are used as state features and as a bounded value-shaping term. Optionally, a Graph Neural Network (GNN) propagates indirect trust over an interaction graph, and Monte Carlo Tree Search (MCTS) with UCT provides look-ahead action values. We evaluate these variants against 47 established opponent strategies spanning deterministic, stochastic, probing, evolution- ary, group-aware, and deceptive behaviors. Each pairing is played for 25 rounds and averaged over 5 independent seeds. We report cumulative wealth, stability (wealth variance across opponents), and resilience on deceptive and probing subsets. Seed-averaged results show TRAVOS-like and Hearsay achieve the highest mean wealth (63.319), followed by Personal Trust (61.387), Bayesian Type (53.557), and Adversarial (45.209). Planned Welch unequal-variance 𝑡-tests with Holm-Bonferroni correction for representative com- parisons yield three results (TRAVOS-like vs. Adversarial, Hearsay vs. Adversarial, and Personal Trust vs. Adversarial) of corrected significance at 𝛼 = 0.05.

Journal Edition Interest: Yes

Supplementary Material: pdf

Submission Number: 28

Loading