How important are visual features for visual grounding? It depends Download PDF

13 Apr 2022OpenReview Archive Direct UploadReaders: Everyone
Abstract: Multi-modal transformer solutions have become the mainstay of visual grounding, where the task is to select a specific object in an image based on a query. In this work, we explore and quantify the importance of CNN derived visual features in these transformers, and test whether these features can be replaced by a semantically driven approach using a scene graph. We propose a new approach for visual grounding based on BERT (Devlin et al., 2019), named metaBERT, that enables reasoning over scene graphs. In order to quantify the importance of visual features, we inject both the scene graph information and the visual features to metaBERT. We find that the additional performance due to the visual features vary among datasets, but is mainly limited to a 10-15% accuracy improvement. Through detailed experiments, we explore the effect of the scene graph quality on the performance, and observe that utilizing scene graphs is notably beneficial for selecting non-human objects.
0 Replies

Loading