Keywords: Mechanistic interpretability, circuits, circuit evaluation
TL;DR: We evaluate circuits from an adversarial perspective: on how many input points does the circuit's output differ from the full model's output, and how large is this difference?
Abstract: Circuits are supposed to accurately describe how a neural network performs a specific task,
but do they really?
We evaluate three circuits found in the literature (IOI, greater-than, and docstring)
in an adversarial manner, considering inputs where the
circuit's behavior maximally diverges from the full model.
Concretely, we measure the KL divergence between the full model's output and the circuit's output,
calculated through resample ablation,
and we analyze the worst-performing inputs.
Our results show that
the circuits for the IOI and docstring tasks fail to behave similarly to the full model
even on completely benign inputs from the original task,
indicating that more robust circuits are needed for safety-critical applications.
Submission Number: 46
Loading