Model Diffing without Borders: Unlocking Cross-Architecture Model Diffing to Reveal Hidden Ideological Alignment in Llama and Qwen

Thomas Jiralerspong; Trenton Bricken

Model Diffing without Borders: Unlocking Cross-Architecture Model Diffing to Reveal Hidden Ideological Alignment in Llama and Qwen

Thomas Jiralerspong, Trenton Bricken

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) SpotlightEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Sparse Autoencoders, AI Safety, Applications of interpretability

Other Keywords: Crosscoders,Sparse Autoencoders,Model Diffing

TL;DR: This paper demonstrates the first successful "diff" between architecturally different AI models (Llama-3.1-8B vs Qwen3-8B), uncovering hidden ideological alignment features in both models.

Abstract: As AI models proliferate with diverse architectures and training procedures, ensuring their safety requires understanding what changed between models: knowing which features were added or modified enables targeted safety audits rather than exhaustive analysis of every model from scratch. However, existing model diffing methods typically require identical architectures, limiting comparisons to base models and their fine-tunes. While crosscoders were introduced to bridge different architectures by learning a shared feature dictionary, their cross-architecture potential has remained undemonstrated. This paper works towards making cross-architecture model diffing practical for AI safety applications by demonstrating the first model diff between architecturally distinct models: Llama-3.1-8B-Instruct and Qwen3-8B. To achieve this, we introduce Dedicated Feature Crosscoders (DFCs), a simple architectural modification that encourages discovery of model-exclusive features by partitioning the feature dictionary. The resulting cross-architecture diff reveals ideological alignment features exclusive to each model that causally control censorship behaviors, alignment with Chinese state narratives, or promotion of American exceptionalism narratives. These results show that cross-architecture crosscoder model diffing is not only possible but can uncover hidden behaviors that could otherwise remain undetected in standard evaluations, demonstrating its potential for identifying safety-relevant differences across the growing ecosystem of diverse AI models.

Submission Number: 261

Loading