Detecting Whether an LLM has been Backdoored

Anthony Hughes; Nicole Xing; Andy Kim; Collin Francel; Andrew Draganov

Detecting Whether an LLM has been Backdoored

Anthony Hughes, Nicole Xing, Andy Kim, Collin Francel, Andrew Draganov

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Benchmarking Interpretability, Interpretability tooling and software, Concept Discovery (e.g., SAEs, dictionary learning)

TL;DR: We provide a suite of backdoored model organisms aimed at studying how defenders can elicit backdoors.

Abstract: As language models are deployed in high-stakes domains, adversaries may poison training data to implant *backdoors*: hidden triggers that covertly manipulate model behavior at inference time. In this work, we formalize the affordances which a defender has and, to evaluate whether defenders can identify backdoors under these affordances, construct a benchmark for backdoor-detection algorithms. This benchmark spans attack mechanisms and objectives, including an adversarial backdoor explicitly designed to evade detection. We use this benchmark to evaluate a suite of backdoor-elicitation hypotheses. We find that while some techniques can flag poisoned models, none reliably surface backdoors. Indeed, hunting for backdoors in poisoned models is likely to surface jailbreaks instead. Finally, we show that backdoor-related activation vectors are consistently different from the vectors which account for undesirable behaviors without triggers. We release our benchmark to motivate the interpretability community to develop stronger algorithms for eliciting backdoors.

Submission Number: 667

Loading