Keywords: Explainable AI, Interpretable ML, visual grounding, Foundation Models
TL;DR: Adaptation can help visually ground image encoders
Abstract: High-dimensional imagery consists of high-resolution information required for end-user decision-making. Due to computational constraints, current methods for image-level classification are designed to train with image chunks or down-sampled images rather than with the full high-resolution context. While these methods achieve impressive classification performance, they often lack visual grounding and, thus, the post hoc capability to identify class-specific, salient objects under weak supervision. In this work, we (1) propose a formalized evaluation framework to assess visual grounding in high-dimensional image applications. To present a challenging benchmark, we leverage a real-world segmentation dataset for post hoc mask evaluation. We use this framework to characterize visual grounding of various baseline methods across multiple encoder classes, exploring multiple supervision regimes and architectures (e.g. ResNet, ViT). Finally, we (2) present prospector heads: a novel class of adaptation architectures designed to improve visual grounding. Prospectors leverage chunk heterogeneity to identify salient objects over long ranges and can interface with any image encoder. We find that prospectors outperform baselines by upwards of +6 balanced accuracy points and +30 precision points in a gigapixel pathology setting. Through this experimentation, we also show how prospectors can enable many classes of encoders to identify salient objects without re-training and also demonstrate their improved performance against classical explanation techniques (e.g. Attention maps).
Submission Number: 80
Loading