Benchmarking Anomaly Detection for Large Language Model Alignment

Benchmarking Anomaly Detection for Large Language Model Alignment

ICLR 2026 Conference Submission19945 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: anomaly detection, AI safety, LLMs, AI alignment, OOD detection

Abstract: Many safety and alignment failures of large language models (LLMs) occur due to anomalous situations: unusual prompts or response patterns that are unforeseen by model developers. Anomaly detection is a promising tool to mitigate these failure modes caused by unknown unknowns; an anomaly detector monitoring a deployed LLM could shut it down or restrict user access in highly unusual situations. We introduce the first anomaly detection benchmark for LLM misalignment, MAAD (Mis-Alignment Anomaly Detection). Benchmarking detection of unforeseen alignment failures is difficult because LLMs are already trained on an extremely broad range of alignment data. Our key insight is that we can force certain known alignment failure modes to remain unseen by explicitly restricting the post-training data that anomaly detection methods can use within MAAD. For example, MAAD tests whether a detector can recognize deception about tool call results without any examples of such deception in the detector's post-training data. We use MAAD to evaluate a number of anomaly detection baselines, including prompting an LLM to ask if a conversation is unusual, measuring the perplexity of prompts and responses, and calculating the Mahalanobis distance of the internal representations of an LLM. We find that perplexity and Mahalanobis distance based detectors perform the best among these baselines, but no method performs at a high level across all failure modes. Our work motivates anomaly detection as an approach to LLM safety and provides a concrete benchmark to measure progress on this important problem. Code and data are available at https://anonymous.4open.science/r/reward-uncertainty-bench-4D66.

Primary Area: alignment, fairness, safety, privacy, and societal considerations

Submission Number: 19945

Loading