PROBING INFORMATION FLOW IN VISION TRANSFORMERS THROUGH CONTROLLED ATTENTION PERTURBATION

Published: 02 Mar 2026, Last Modified: 02 Mar 2026Sci4DL 2026EveryoneRevisionsBibTeXCC BY 4.0
Keywords: vision transformers, sparse attention, diffusion models, small-world networks, ablation study
TL;DR: We mask ViT attention using graph-theoretic patterns and find diffusion is drastically more sensitive than classification/retrieval and local connectivity matters far more than random shortcuts.
Abstract: We apply identical attention sparsity to three vision transformer tasks and find order-of-magnitude differences in sensitivity: at 75% sparsity, CLIP retrieval de- grades 2%, classification degrades 7%, while diffusion generation degrades 274%. To systematically probe this, we design three masking strategies with distinct graph-theoretic properties (small-world, preferential attachment, hub-spoke) and measure degradation across density levels. Ablating small-world masks reveals that spatial locality, not long-range shortcuts, drives performance preservation, with local-only connectivity outperforming random-only by 7.6×. We hypothe- size that diffusion’s sensitivity arises from error accumulation across 250 sequen- tial denoising steps, where each disruption compounds through subsequent iter- ations. These findings demonstrate how controlled perturbation can reveal task- dependent differences in transformer information flow that static analysis would miss.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 85
Loading