Keywords: LLM, Agents, Monitoring, Alignment, AI Control, Jailbreaking
TL;DR: Our work shows that off-the-shelf LLM monitors struggle to reliably detect malicious intent hidden within seemingly benign subtasks, highlighting significant limitations and the need for more specialized monitoring techniques.
Abstract: Monitoring Large Language Model (LLM) agents is critical for detecting and mitigating catastrophic risk in real-world applications. Performing such monitoring is particularly difficult since the harm caused by the agent may be sequentially contextual. This means that monitoring individual instructions or actions executed by the agent is not enough to identify the harm. Instead, sequentially contextual harm can only be identified by analyzing the composition of multiple instructions or actions. In this work, we first demonstrate such a risk in agent settings by decomposing harmful tasks into individually (seemingly) benign subtasks --- the refusal rate goes down significantly (e.g., from 50% to 10% for GPT-4o) while maintaining a high task completion rate thus motivating the need for external monitors. We holistically evaluate off-the-shelf LLMs as monitors that aim to infer malicious intent from these seemingly benign subtasks. To facilitate our study, we curate 50 unique agent tasks, covering 8 categories, including disinformation, fraud, and harassment. Our experiments show that frontier models as monitors can predict binary intentions (malicious vs benign), achieving up to 86% accuracy, and also infer user intent in natural language. However, these off-the-shelf LLM monitors are not infallible. We find that: (1) there is a significant gap in monitor accuracy when judging seemingly benign subtasks versus directly judging the high-level harmful instructions; (2) unrelated benign subtasks can be injected into the sequence of subtasks to mask malicious intent further, resulting in drastically degraded monitoring accuracy; (3) basic prompt engineering techniques or employing an ensemble of LLM monitors does not reliably improve monitoring performance; and (4) more capable models do not naturally yield better monitoring ability. In summary, our work empirically shows the risk of sequentially contextual harm in LLM agent settings and discovers significant limitations when using frontier models as monitors. Based on these results, we call for specialized training approaches to develop more robust agent monitoring systems.
Submission Type: Long Paper (9 Pages)
Archival Option: This is a non-archival submission
Presentation Venue Preference: ICLR 2025
Submission Number: 44
Loading