Keywords: Methods (probing, steering, causal interventions), Interpretability for AI Safety, Concept Discovery (e.g., SAEs, dictionary learning)
TL;DR: We propose a principled way of improving Activation Oracle training and show improvements regarding hallucinations and vagueness.
Abstract: _Activation Oracles_ (AOs) are promising methods for interpreting residual stream activations.
However, current AOs suffer face important issues, such as hallucinations, vagueness. Additionally, text-inversion confounds make them hard to evaluate.
To this end, we propose two principles for training data construction: _solvability_, realized by training on on-policy data, and _targetedness_, realized by avoiding gaming the target through text inversion. We find that these interventions yield modest but significant improvements on hallucination and vagueness, and is overall more usable.
In addition, we open source the first comprehensive evaluation suite for AO quality, which we call _AObench_.
Additionally, we share preliminary negative results regarding _Multi-Layer-Activation Oracles_ (MLAO) do work, reduce training loss, but do not lead to substantial uplift in downstream evaluations, contrary to what one might expect.
Overall, we hope that our work sets a foundation that helps improve AOs, joining a paradigm of scalable, end-to-end interpretability.
Submission Number: 790
Loading