Abstract: Previous studies have demonstrated that complexity and variation of event images are the major challenges in event classification. We approach the problem through an integrated methodology by utilizing Long Short-Term Memory network (LSTM) to fuse multiple Convolutional Neural Networks (CNNs). To address the issue of complexity, we use three specific CNNs to extract the scene, object and human visual cues respectively. To reduce the semantic gap and utilize the complementarity of the features in different levels, we choose AlexNet and VGG-16 network as the basic structures, and concatenate their outputs of the first fully-connected layer and the second fully-connected layer. Considering the contextual correlations between visual cues, we arrange the concatenations of three CNNs in the sequence of scene, object and human as a whole and put into the LSTM network. Particularly for context, we crop the images into five blocks as input and an individual image is supplemented with contextual features due to the temporal characteristics of the LSTM. We evaluate our method on the Web Image Dataset for Event Recognition (WIDER), and the obtained results demonstrate the effectiveness of all the above points. Compared with the state-of-the-art methods, the proposed method gives a considerable way for improving the performance on event classification.
Loading