Keywords: astronomy, concept erasure, mechanistic interpretability, representation geometry, vision language models
TL;DR: Astronomy redshift as ground truth shows linear probe monitors can be fooled. Techniques such as subspace erasure/steering hides information without deleting it, exposing limits of probe-based auditing.
Abstract: Studies of mechanistic interpretability often target concepts like honesty or deception, where ground truth is ambiguous. We propose a comprehensive testbed: astronomical observations with physically-defined labels. Using 74\,925 galaxy images with spectroscopic redshifts that measure astronomical distance, we study how this quantity is represented across three architectures (DINOv2, Qwen2-VL, AstroPT). We test four common assumptions and find: (i) distance is linearly decodable yet not axis-aligned, instead concentrated in a low-rank subspace; (ii) cross-model geometry can align while usable linear features do not transfer (a geometric-functional paradox); (iii) steering along learned distance directions causally shifts distance-related language, with strong prompt dependence; and (iv) breaking a monitor via linear removal is not a certificate of deletion under adversarial audit. Overall, astronomy provides a grounded sandbox that makes interpretability claims falsifiable under controlled manipulations.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 64
Loading