Mechanistic Anomaly Detection for "Quirky'' Language Models

David O. Johnston; Arkajyoti Chakraborty; Nora Belrose

Mechanistic Anomaly Detection for "Quirky'' Language Models

David O. Johnston, Arkajyoti Chakraborty, Nora Belrose

Published: 05 Mar 2025, Last Modified: 15 Apr 2025BuildingTrustEveryoneRevisionsBibTeXCC BY 4.0

Track: Long Paper Track (up to 9 pages)

Keywords: scalable oversight, backdoor detection, mechanistic interpretability, outlier detection, anomaly detection

TL;DR: We test detectors for finding prompts that elicit anomalous behaviour in a collection of different models, our detectors are effective on some models but not all of them.

Abstract: As LLMs grow in capability, the task of supervising LLMs becomes more challenging. Supervision failures can occur if LLMs are sensitive to factors that supervisors are unaware of. We investigate __Mechanistic Anomaly Detection__ (MAD) as a technique to augment supervision of capable models; we use internal model features to identify anomalous training signals so they can be investigated or discarded. We train detectors to flag points from the test environment that differ substantially from the training environment, and experiment with a large variety of detector featuers and scoring rules to detect anomalies in a set of "quirky" language models. We find that detectors can achieve high discrimination on some tasks, but no detector is effective across all models and tasks. MAD techniques may be effective in low-stakes applications, but advances in both detection and evaluation are likely needed if they are to be used in high stakes settings.

Submission Number: 99

Loading