Progressive Feature Encoding with Background Perturbation Learning for Ultra-Fine-Grained Visual Categorization

Xin Jiang, Ziye Fang, Fei Shen, Junyao Gao, Zechao Li

Published: 01 Jan 2026, Last Modified: 19 Jan 2026IEEE Transactions on Image ProcessingEveryoneRevisionsCC BY-SA 4.0
Abstract: Ultra-Fine-Grained Visual Categorization (Ultra-FGVC) aims to classify objects into sub-granular categories, presenting the challenge of distinguishing visually similar objects with limited data. Existing methods primarily address sample scarcity but often overlook the importance of leveraging intrinsic object features to construct highly discriminative representations. This limitation significantly constrains their effectiveness in Ultra-FGVC tasks. To address these challenges, we propose SV-Transformer that progressively encodes object features while incorporating background perturbation modeling to generate robust and discriminative representations. At the core of our approach is a progressive feature encoder, which hierarchically extracts global semantic structures and local discriminative details from backbone-generated representations. This design enhances inter-class separability while ensuring resilience to intra-class variations. Furthermore, our background perturbation learning mechanism introduces controlled variations in the feature space, effectively mitigating the impact of sample limitations and improving the model’s capacity to capture fine-grained distinctions. Comprehensive experiments demonstrate that SV-Transformer achieves state-of-the-art performance on benchmark Ultra-FGVC datasets, showcasing its efficacy in addressing the challenges of Ultra-FGVC task.
Loading