Probing Perturbation Invariance in DINOv2: Mechanistic Gaps Between Real and Generated Image Representations

Published: 11 Jun 2026, Last Modified: 11 Jun 2026Mech Interp Workshop ICML 2026 VirtualposterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Applications of interpretability
Other Keywords: mechanistic interpretability, probing, vision transformers, DINOv2, representation geometry, iBOT, perturbation sensitivity, AI image detection
TL;DR: DINOv2 patch tokens are more noise-sensitive for AI-generated images (OOD to DINOv2's training) than real photos; averaging all 256 patch token shifts beats the CLS-only baseline by 14% worst-case with no added computation.
Abstract: We show that AI-generated images---out-of-distribution (OOD) relative to DINOv2's pretraining corpus---exhibit systematically higher sensitivity to Gaussian noise perturbation than real photographs, measured via patch-token cosine distance across five generators in the Synthbuster benchmark. A perturbation-type natural experiment confirms the mechanism: blur within DINOv2's training augmentation range ($r{=}1$\,px) gives worst-case $|d|{=}0.134$, while noise ($\sigma{=}100$, absent from training) gives $|d|{=}0.93$---a $7{\times}$ gap; blur exceeding the training range ($r{\geq}2$\,px) gives $|d|{\geq}1.47$, confirming OOD status drives the signal. Averaging all 256 spatial patch tokens rather than the single CLS token improves worst-case Cohen's~$|d|$ from $0.86$ to $0.98$ (95\% CI: $[0.78, 1.20]$) by accessing local perturbation responses that global attention pooling discards. Leave-one-generator-out cross-validation ($|d|{=}0.949$, $\Delta{=}0.031$) confirms the $\sigma{=}100$ operating point is not cherry-picked, and non-perturbative baselines (worst-case $|d|{\leq}0.071$) confirm perturbation is essential. Code is available at \href{https://anonymous.4open.science/r/probing-perturbation-invariance-dinov2-EB58/}{r/probing-perturbation-invariance-dinov2-EB58/}.
Submission Number: 45
Loading