Keywords: Fine-Grained Visual Understanding, Ultra-Fine-Grained Recognition, Multimodal AI, Trustworthy AI, Visual-Language Alignment
TL;DR: This paper reviews fine-grained and ultra-fine-grained visual understanding, highlighting their roles, challenges, and future directions in multimodal and trustworthy AI.
Abstract: Fine-Grained (FG) and Ultra-Fine-Grained (UFG) Visual Understanding has recently become an important problem in AI research, because of its considerable ability to distinguish objects visually very similar but semantically different. This paper aims to offer a viewpoint-based overview and taxonomy on the state of the art of the FGVC tasks, covering existing FGVC datasets as well as approaches, and identify possible pros and cons for FGVC datasets in terms of scalability, cost basis of annotating, domain coverage and generalization. We also review how recent trends, with transformer-based vision architecture, advanced data augmentation (in particular generative augmentation), and the multimodal integration of vision, language/metadata are pushing towards the practical need for FGVC and Ultra-Fine-Grained Visual Categorization (UFGVC) necessary in building multimodal and trustworthy AI systems. In our study, we detect a number of standing challenges: lack of public ultra-fine-grained datasets, high annotation complexity, non-trivial long-tail/rare class learning setting and inadequate exploitation on multimodal and semantic context. Finally, we present a prospective research roadmap on multimodal visual understanding covering robustness under long-tailed distributions, explainability, data efficient learning and deployment to real-world applications. Motivated by previous advances, as well as the remaining challenges in this field, we hope that this survey can shed light on the design of more generalizable, reliable and semantically-grounded visual intelligence.
Submission Number: 4
Loading