Inherently Faithful Attention Maps for Vision Transformers

Inherently Faithful Attention Maps for Vision Transformers

TMLR Paper5322 Authors

07 Jul 2025 (modified: 18 Sept 2025)Rejected by TMLREveryoneRevisionsBibTeXCC BY 4.0

Abstract: We introduce an attention-based method that uses learned binary attention masks to ensure that only attended image regions influence the prediction, a property we term inherently faithful. Context can strongly affect object perception, sometimes leading to biased representations, particularly when objects appear in out-of-distribution backgrounds. At the same time, many image-level object-centric tasks require identifying relevant regions, often requiring context. To address this conundrum, we propose a two-stage framework: stage 1 processes the full image to discover object parts and identify task-relevant regions, while stage 2 leverages input attention masking to restrict its receptive field to these regions, enabling a focused analysis while filtering out potentially spurious information. Both stages are trained jointly, allowing stage 2 to refine stage 1. Extensive experiments across diverse benchmarks demonstrate that our approach significantly improves robustness against spurious correlations and out-of-distribution backgrounds.

Submission Length: Regular submission (no more than 12 pages of main content)

Assigned Action Editor: ~Stephen_Lin1

Submission Number: 5322

Loading