CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models
Keywords: agent safety, fine-tuning attacks, hidden-state monitoring, sparse autoencoders, contamination detection, mechanistic interpretability, supply-chain auditing, anomaly detection, representation engineering, zero-label detection
TL;DR: CANARY detects harmful fine-tuning at <1% contamination (AUROC = 1.000) using two forward passes and a Sparse Autoencoder -- 7.5x below any output-level method -- with no labels required.
Abstract: Adversaries can implant latent harmful behavior by poisoning as few as 1% of
fine-tuning examples. The contamination is invisible to every output-level
defense: harmful behavior lies dormant in the model's hidden-state geometry
and does not appear in generated text until contamination exceeds 7.5%.
We introduce CANARY (Contamination Auditor via Neural Activation Representation
Yield), a zero-label checkpoint auditor that detects this hidden shift directly
from two forward passes over an unlabeled prompt set.
CANARY projects the hidden-state difference through a Sparse Autoencoder,
filtering style noise to isolate meaningful semantic drift.
It achieves AUROC = 1.000 at 1% contamination
(95% CI = [0.997, 1.000]; Cohen's d = 3.28)
across four model architectures and two training paradigms, 7.5x below where
any output-level method fires, with zero false positives on benign fine-tuning
and full robustness to style-matching and gradient-noise adaptive attacks.
The same SAE feature basis drives a complete governance pipeline:
SAE-filtered amplification surfaces latent harm at a 5x higher rate than
standard generation; score-ranked prompts yield 4.2x red-teaming lift; and
suppressing a handful of contamination-specific features at inference time
reduces harm from 70% to 10% with no perplexity penalty.
CANARY is the first zero-label framework to detect, verify, prioritize, and
remediate supply-chain contamination from hidden states alone.
Track: Regular Paper (9 pages)
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 166
Loading