Blur to Focus Attention in Fine-Grained Visual Recognition

ICLR 2026 Conference Submission1220 Authors

03 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: Ultra-fine-grained, fine-grained, attention, blur-to-focus, patchification
Abstract: Fine-grained visual recognition (FGVR) requires distinguishing categories separated by tiny discriminative cues such as fine textures, part shapes, or color patterns. In typical datasets, discriminative regions occupy less than 30% of the image area, and in ultra-fine-grained cases often under 10%. This sparsity makes training highly fragile. Standard data augmentations risk destroying these subtle signals, while part-based or attention-driven models depend on annotations or rigid architectures and often fail under pose variation, occlusion, or cluttered backgrounds. We present DEFOCA, a simple layer that patchifies an image and stochastically applies Gaussian blur to selected patches. Each patch selection (e.g., random or contiguous) defines a single view, and multiple such views encourage the model to rely on diverse subsets of discriminative cues while reducing dependence on spurious background features. In this way, DEFOCA functions as a soft, attention-like mechanism that integrates seamlessly with existing architectures. Theoretically, we show that DEFOCA is label-safe, that contiguous patch layouts maximize the probability of label-safety, and that the expected representation drift is minimized. This guarantees that critical features are preserved while irrelevant high-frequency noise is suppressed, thereby narrowing the generalization gap. Empirically, DEFOCA achieves competitive performance on widely used fine-grained benchmarks (CUB-200-2011, Stanford Cars, NABirds, FGVC Aircraft) as well as ultra-fine-grained datasets (Cotton80, SoyGene, SoyGlobal). These results establish DEFOCA as a principled and highly effective solution for robust and discriminative feature learning in FGVR.
Primary Area: unsupervised, self-supervised, semi-supervised, and supervised representation learning
Submission Number: 1220
Loading