Keywords: vision transformers, sparse attention, diffusion models, small-world networks, ablation study
TL;DR: We mask ViT attention using graph-theoretic patterns and find diffusion is drastically more sensitive than classification/retrieval and local connectivity matters far more than random shortcuts.
Abstract: We apply identical attention sparsity to three vision transformer tasks and find
order-of-magnitude differences in sensitivity: at 75% sparsity, CLIP retrieval de-
grades 2%, classification degrades 7%, while diffusion generation degrades 274%.
To systematically probe this, we design three masking strategies with distinct
graph-theoretic properties (small-world, preferential attachment, hub-spoke) and
measure degradation across density levels. Ablating small-world masks reveals
that spatial locality, not long-range shortcuts, drives performance preservation,
with local-only connectivity outperforming random-only by 7.6×. We hypothe-
size that diffusion’s sensitivity arises from error accumulation across 250 sequen-
tial denoising steps, where each disruption compounds through subsequent iter-
ations. These findings demonstrate how controlled perturbation can reveal task-
dependent differences in transformer information flow that static analysis would
miss.
Anonymization: This submission has been anonymized for double-blind review via the removal of identifying information such as names, affiliations, and identifying URLs.
Style Files: I have used the style files.
Submission Number: 85
Loading