CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Swapnil Parekh

CANARY: Zero-Label Detection of Fine-Tuning Contamination in Language Models

Swapnil Parekh

Published: 23 May 2026, Last Modified: 23 May 2026ICML 2026 AIWILDEveryoneRevisionsBibTeXCC BY 4.0

Keywords: agent safety, fine-tuning attacks, hidden-state monitoring, sparse autoencoders, contamination detection, mechanistic interpretability, supply-chain auditing, anomaly detection, representation engineering, zero-label detection

TL;DR: CANARY detects harmful fine-tuning at <1% contamination (AUROC = 1.000) using two forward passes and a Sparse Autoencoder -- 7.5x below any output-level method -- with no labels required.

Abstract: Adversaries can implant latent harmful behavior by poisoning as few as 1% of fine-tuning examples. The contamination is invisible to every output-level defense: harmful behavior lies dormant in the model's hidden-state geometry and does not appear in generated text until contamination exceeds 7.5%. We introduce CANARY (Contamination Auditor via Neural Activation Representation Yield), a zero-label checkpoint auditor that detects this hidden shift directly from two forward passes over an unlabeled prompt set. CANARY projects the hidden-state difference through a Sparse Autoencoder, filtering style noise to isolate meaningful semantic drift. It achieves AUROC = 1.000 at 1% contamination (95% CI = [0.997, 1.000]; Cohen's d = 3.28) across four model architectures and two training paradigms, 7.5x below where any output-level method fires, with zero false positives on benign fine-tuning and full robustness to style-matching and gradient-noise adaptive attacks. The same SAE feature basis drives a complete governance pipeline: SAE-filtered amplification surfaces latent harm at a 5x higher rate than standard generation; score-ranked prompts yield 4.2x red-teaming lift; and suppressing a handful of contamination-specific features at inference time reduces harm from 70% to 10% with no perplexity penalty. CANARY is the first zero-label framework to detect, verify, prioritize, and remediate supply-chain contamination from hidden states alone.

Track: Regular Paper (9 pages)

Email Sharing: We authorize the sharing of all author emails with Program Chairs.

Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.

Submission Number: 166

Loading