Keywords: Concept Discovery (e.g., SAEs, dictionary learning), Methods (probing, steering, causal interventions), Interpretability for AI Safety
Other Keywords: Scalable-End-To-End-Interpretability
TL;DR: We introduce LoRAcles: fine-tuned language models that take LoRA adapter weights as input, and answer natural language questions about them.
Abstract: Fine-tuned LLMs can learn complex and subtle behaviors. However, it can be difficult to ascertain what behaviors the model acquired during fine-tuning. We introduce LoRAcles: fine-tuned language models that take LoRA adapter weights as input, and answer natural language questions about them.
We introduce a self-supervised pipeline for training LoRAcles: we first train LoRAs on small sets of pre-training documents, and then train LoRAcles on these LoRAs to answer questions about the documents. LoRAcles generalize far beyond their training data: they are state-of-the-art on AuditBench, a benchmark of models with sophisticated hidden behaviors; they can detect subtle changes in model organisms such as subliminal learning; and they are the first tool to achieve non-trivial performance at verbalizing semantic backdoor triggers. LoRAcles can also describe some learned behaviors from the weights of a full-parameter fine-tune. LoRAcles are a highly scalable method: we train LoRAcles for Qwen-3-14B and Llama-3.3-70B, and observe that performance scales smoothly with the size of our training dataset up to 100K LoRAs, though our method may need refinement in order to scale further.
We note, though, that LoRAcles are prone to hallucinations and currently only elicit the most salient behavioral changes, limiting their utility beyond narrow fine-tunes.
Submission Number: 571
Loading