Paranoid Monitors: How Long Context Breaks LLM Agent Supervision

Published: 28 Feb 2026, Last Modified: 04 Apr 2026CAO PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: AI safety, agent monitoring, long-context evaluation, factored cognition, AI control, LLM agents, safety benchmarks
TL;DR: LLM monitors become paranoid in long contexts, losing the ability to distinguish malicious from benign agent behavior; a two-stage ledger approach that decouples state tracking from judgment restores calibration at 50k+ tokens.
Abstract: Monitoring LLM agents in long-context environments remains an unsolved challenge. We demonstrate that current monitoring approaches rely on suspicion heuristics that degrade as context length increases. Through evaluation of synthetically augmented trajectories from the ShadeArena bank transfer task, we show that standard monitors suffer from paranoia, assigning higher suspicion scores to benign trajectories than malicious ones in contexts exceeding 30k tokens. We quantify this failure using the suspicion gap, defined as the difference in mean suspicion scores assigned to malicious versus benign trajectories. To address this, we propose ledger-based monitoring, a two-stage framework grounded in factored cognition that decouples fact extraction from judgment. Our method restores calibration across all tested models, improving the suspicion gap from approximately 0 to greater than 1.5 in 50k+ token contexts.
Submission Number: 21
Loading