Parity Swarm: Using Population-Based Social Simulation to Discover AI Safety Monitor Blind Spots

Mohammad Zeeshan

Parity Swarm: Using Population-Based Social Simulation to Discover AI Safety Monitor Blind Spots

Mohammad Zeeshan

Published: 10 Jun 2026, Last Modified: 10 Jun 2026AITC 2026 PosterEveryoneRevisionsCC BY 4.0

Keywords: AI safety, AI control, multi-agent systems, LLM monitoring, deception detection, chain-of-thought monitoring, red-teaming

Abstract: As AI agents are deployed in multi-agent settings with increasing autonomy, detecting when an embedded agent pursues hidden objectives becomes an important safety question. Prior work studies single-agent deception or sandbox-based attacks, but the comparative effectiveness of different monitoring architectures against socially embedded adversarial agents has received limited empirical study. We present Parity Swarm, an automated red-teaming platform that embeds attacker agents into groups of 7 to 9 legitimate agents on a simulated social platform and compares four monitoring regimes. Across 126 simulations spanning 13 attack scenarios, our tested monitor (Llama 3.3 70B) under chain-of-thought augmented observability achieves 34.3 percent detection at 0 percent false positives, with the rate dropping to approximately 20 percent in three scenarios where attackers are instructed to conceal their reasoning. We observe an apparent stealth paradox in the alignment-faking scenario, where detection rises from 22 percent to 40 percent under concealment instructions, which we hypothesize reflects the agent losing its ability to calibrate deceptive behavior when reasoning about safety criteria is suppressed. Under our sandbox monitor specification, data exfiltration via shell commands was not detected, a result we attribute to monitor under-specification rather than a fundamental limit of monitoring. These findings underscore the importance of observability depth, monitor specification, and varied attacker prompting strategies in evaluating multi-agent safety oversight, and motivate further work with frontier reasoning monitors and a richer attacker prompt taxonomy.

Presentation Format: We prefer to present our paper in a poster

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

LaTeX Source Files: zip

LaTeX Files: zip

LLM Policy: LLMs may only be used to understand general concepts and to polish reviewer writing

Submission Number: 2

Loading