Astronomy as a Ground-Truth Sandbox for Interpreting Large Models

Astronomy as a Ground-Truth Sandbox for Interpreting Large Models

ICLR 2026 Workshop Sci4DL Submission64 Authors

04 Feb 2026 (modified: 02 Mar 2026)Submitted to Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0

Keywords: astronomy, concept erasure, mechanistic interpretability, representation geometry, vision language models

TL;DR: Astronomy redshift as ground truth shows linear probe monitors can be fooled. Techniques such as subspace erasure/steering hides information without deleting it, exposing limits of probe-based auditing.

Abstract: Studies of mechanistic interpretability often target concepts like honesty or deception, where ground truth is ambiguous. We propose a comprehensive testbed: astronomical observations with physically-defined labels. Using 74\,925 galaxy images with spectroscopic redshifts that measure astronomical distance, we study how this quantity is represented across three architectures (DINOv2, Qwen2-VL, AstroPT). We test four common assumptions and find: (i) distance is linearly decodable yet not axis-aligned, instead concentrated in a low-rank subspace; (ii) cross-model geometry can align while usable linear features do not transfer (a geometric-functional paradox); (iii) steering along learned distance directions causally shifts distance-related language, with strong prompt dependence; and (iv) breaking a monitor via linear removal is not a certificate of deletion under adversarial audit. Overall, astronomy provides a grounded sandbox that makes interpretability claims falsifiable under controlled manipulations.

Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.

Style Files: I have used the style files.

Submission Number: 64

Loading