Keywords: AI Safety, Alignment, Auditing, LLM
TL;DR: Alignment auditing methods leveraging diffing between base and fine-tuned models
Abstract: Successful Alignment auditing — investigating AI systems for hidden or unintended behaviors — is a key challenge for safe deployment of frontier models. While recent work has explored comparing a fine-tuned model to its base, These methods fail to isolate the unusual behavior differences sought after in auditing. We introduce two model diffing methods for auditing fine-tuned models: SVD rank truncation, a white-box method which isolates implanted behaviors by projecting weight-difference matrices onto their dominant singular direction, revealing that behavioral changes induced by fine-tuning are geometrically concentrated; and adversarial decoding, a black-box method which amplifies contrastive logit differences between a fine-tuned model and a reference, exposing behavior-relevant tokens suppressed below the sampling threshold in normal generation. We evaluate both methods on AuditBench, a benchmark of 56 language model organisms spanning 14 implanted behaviors trained to resist confession. SVD rank truncation achieves substantial improvements on models trained by synthetic document fine-tuning above previous state-of-the-art methods, but remains near baseline on transcript-distilled model organisms. Adversarial decoding matches this performance and generalizes to settings without base model access by using a safety-prompted reference, suggesting that fine-tuning suppresses safety-relevant tokens in a recoverable way. Together, these results suggest that model diffing is an effective technique for behavioral auditing.
Email Sharing: We authorize the sharing of all author emails with Program Chairs.
Data Release: We authorize the release of our submission and author names to the public in the event of acceptance.
Submission Number: 364
Loading