Flag Game: Interpreting Decision Mechanisms of Bounded Social Agents

Published: 03 Jun 2026, Last Modified: 03 Jun 2026AI4GOOD Workshop 2026 RegularEveryoneRevisionsBibTeXCC BY 4.0
Keywords: multi-agent systems, collective decision-making, pluralistic alignment, consensus, polarization, social learning, mechanistic interpretability
Abstract: Every bounded agent in a multi-agent system must balance its private evidence against social input from peers. We make this balance experimentally observable with the Flag Game, a grounded synthetic task in which a hidden country flag defines verifiable ground truth, each agent observes only a private crop, agents communicate under a specified protocol, and the system outputs a final country distribution. Despite its simplicity, the task reproduces nontrivial collective phenomena---non-monotonic population scaling, gains from social-awareness prompting and model diversity, and polarization---while remaining diagnosable at the mechanism level. We explain these phenomena with a unifying framework, building on Quantized Simplex Gossip (QSG), that traces them to two facets of the same private--social tension: how much private evidence the population collectively holds, and how each agent integrates social input. Methodologically, the Flag Game extends the toy-model strategy of mechanistic interpretability to multi-agent systems---a controlled synthetic task where rich phenomenology can be both discovered and mechanistically dissected, before moving to open-ended domains where truth is harder to verify and failure is harder to interpret.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 504
Loading