LoRAcles: Self-Supervised Weight-Space Interpretability at Scale

Celeste De Schamphelaere; Jan Philipp Bauer; Neel Nanda; Euan Ong

LoRAcles: Self-Supervised Weight-Space Interpretability at Scale

Celeste De Schamphelaere, Jan Philipp Bauer, Neel Nanda, Euan Ong

Published: 11 Jun 2026, Last Modified: 22 Jun 2026Mech Interp Workshop ICML 2026 PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions), Interpretability for AI Safety

Other Keywords: Scalable-End-To-End-Interpretability

TL;DR: We introduce LoRAcles: fine-tuned language models that take LoRA adapter weights as input, and answer natural language questions about them.

Abstract: Fine-tuned LLMs can learn complex and subtle behaviors. However, it can be difficult to ascertain what behaviors the model acquired during fine-tuning. We introduce LoRAcles: fine-tuned language models that take LoRA adapter weights as input, and answer natural language questions about them. We introduce a self-supervised pipeline for training LoRAcles: we first train LoRAs on small sets of pre-training documents, and then train LoRAcles on these LoRAs to answer questions about the documents. LoRAcles generalize far beyond their training data: they are state-of-the-art on AuditBench, a benchmark of models with sophisticated hidden behaviors; they can detect subtle changes in model organisms such as subliminal learning; and they are the first tool to achieve non-trivial performance at verbalizing semantic backdoor triggers. LoRAcles can also describe some learned behaviors from the weights of a full-parameter fine-tune. LoRAcles are a highly scalable method: we train LoRAcles for Qwen-3-14B and Llama-3.3-70B, and observe that performance scales smoothly with the size of our training dataset up to 100K LoRAs, though our method may need refinement in order to scale further. We note, though, that LoRAcles are prone to hallucinations and currently only elicit the most salient behavioral changes, limiting their utility beyond narrow fine-tunes.

Submission Number: 571

Loading