Abstract: This paper proposes a method that aggregates rich deep semantic features for fine-grained place classification. As is known to all, the category of images depends on the objects and text as well as the various semantic regions, hierarchical structure, and spatial layout. However, most recently designed fine-grained classification systems ignored this, the complex multi-level semantic structure of images associated with fine-grained classes has not yet been well explored. Therefore, in this work, our approach composed of two modules: Content Estimator (CNE) and Context Estimator (CXE). CNE generates deep content features by encoding global visual cues of images. CXE obtains rich context features of images, and it consists of three children Estimator: Text Context Estimator (TCE), Object Context Estimator (OCE), and Scene Context Estimator (SCE). When inputting an image into CXE, TCE encodes text cues to identify word-level semantic information, OCE extracts high-dimensional feature then maps it to object semantic information, SCE gains hierarchical structure and spatial layout information by recognizing scene cues. To aggregate rich deep semantic features, we fuse the information about CNE and CXE for fine-grained classification. To the best of our knowledge, this is the first work to leverage the text information from an arbitrary-oriented scene text detector for extracting context information. Moreover, our method explores the fusion of semantic features and demonstrates scene features to give more complementary information with the other cues. Furthermore, the proposed approach achieves state-of-the-art performance on a fine-grained classification dataset, 84.3% on Con-Text.
0 Replies
Loading