Rethinking Texture Bias in Vision Transformer

ICLR 2026 Conference Submission22103 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Novel View Synthesis, Dynamic Scene, Gaussian Splatting
Abstract: Vision Transformer (ViT)-based foundation models have shown impressive performance on broad tasks but struggle in fine-grained applications that depend on local texture. This challenge stems from their lack of inductive biases toward localized visual features, a critical gap for tasks in graphics and vision. To investigate this, we introduce a base-to-novel generalization framework that isolates texture sensitivity while controlling for dataset scale and application-specific constraints. Our analysis reveals that ViTs exhibit a pronounced deficiency in recognizing local textures, while demonstrating a preference for global textures presented at large spatial scales. To understand the origin of this bias, we conduct a systematic study across training, data, and architectural factors, focusing on texture disentanglement, spatial scale sensitivity, and noise robustness. We further employ representational analysis to expose ViTs' limitations in modeling fine-grained texture patterns. Our work provides actionable insights for improving the inductive biases of ViT-based foundation models, informing robust texture representation in graphics applications.
Supplementary Material: pdf
Primary Area: applications to computer vision, audio, language, and other modalities
Submission Number: 22103
Loading