Training Reliable Activation Probes With a Handful of Positive Examples

Riya Tyagi; Stefan Heimersheim

Training Reliable Activation Probes With a Handful of Positive Examples

Riya Tyagi, Stefan Heimersheim

Published: 30 Sept 2025, Last Modified: 30 Sept 2025Mech Interp Workshop (NeurIPS 2025) PosterEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Probing, AI Safety, Sparse Autoencoders

TL;DR: When training activation probes on datasets with severe positive-class scarcity, utilizing the negative examples can boost performance in comparison to having a small, balanced dataset.

Abstract: Efforts to monitor advanced AI for rare misalignments face a data challenge: abundant aligned examples but only a handful of misaligned ones. We test activation probes in this "few vs. thousands" regime on spam and honesty detection tasks. For our tasks, training with many negative examples is more positive-sample-efficient than balanced training; for instance, with just a single positive spam email, linear probes can achieve an AUC of $0.90$, whereas balanced training achieves $0.80$ AUC on spam detection. We also find that LLM upsampling can provide a performance boost equivalent to roughly doubling the number of real positive samples, though excessive upsampling hurts performance. Finally, we show a positive scaling trend, where larger models are more positive-sample-efficient to probe. Our findings suggest we should leverage the large number of negative samples available to amplify the signal from rare but critical misalignment examples.

Submission Number: 66

Loading