Abstract: Audiovisual scene understanding is a challenging problem due to the unstructured spatial–temporal relations that exist in the audio signals and spatial layouts of different objects in the visual images. Recently, many studies have focused on abstracting features from convolutional neural networks, while the learning of explicit semantically relevant frames of sound signals and visual images has been overlooked. To this end, we present an end-to-end framework, namely, attentional graph convolutional network (AGCN), for structure-aware audiovisual scene representation. First, the spectrogram of sound and input image is processed by a backbone network for feature extraction. Then, to build multiscale hierarchical information of input signals, we utilize an attention fusion mechanism to aggregate features from multiple layers of the backbone network. Notably, to well represent the salient regions and contextual information, the salient acoustic graph (SAG) and contextual acoustic graph (CAG), salient visual graph (SVG), and contextual visual graph (CVG) are constructed for the audiovisual scene representation. Finally, the constructed graphs pass through a graph convolutional network for structure-aware audiovisual scene recognition. Extensive experimental results on the audio, visual, and audiovisual scene recognition datasets show that promising results have been achieved by the AGCN methods. We have achieved 90.6 precision on the ADVANCE dataset, showing a 20.5% improvement over the previous method. Visualizing graphs on the spectrograms and images have been presented to show the effectiveness of the proposed CAG/SAG and CVG/SVG that could focus on the salient and semantic relevant regions.
0 Replies
Loading