Abstract: Industrial intelligent devices are usually equipped with both microphones and cameras to perceive and understand the physical world. Though visual object detection technology has achieved a great success, its combination with other sensing modalities remains unsolved. In this article, we establish a novel sound-induced attention framework for the visual object detection, and develop a two-stream weakly supervised deep learning architecture to combine the visual and audio modalities for localizing the sounding object. A dataset is constructed from the Audio Set to validate the proposed method and some realistic experiments are conducted to demonstrate the effectiveness of the proposed system.
0 Replies
Loading