Abstract: Multimedia event detection (MED) is the task of detecting given events (e.g. parade, birthday party) in a large collection of video clips. While the most useful information comes from visual features and speech recognition, a lot can also be inferred from the non-speech audio content, either alone or in conjunction with visual and speech cues. This paper studies MED with non-speech audio information only. MED is usually performed in two stages. The first stage generates a representation for each clip in the form of either a single vector or a sequence of vectors, often by aggregating frame-level features; the second stage performs binary or multi-class classification to decide whether each target event occurs in each clip. Common classifiers used for the second stage include support vector machines (SVMs), feed-forward deep neural networks (DNNs), and recurrent neural networks (RNNs). In this paper, we propose to classify clips for events using "recurrent SVMs". These models combine the kernel mapping and the large-margin optimization criterion of SVMs, and the ability to process sequences of variable lengths of RNNs. Reinforced with data augmentation, recurrent SVMs have achieved higher mean average precision (MAP) on the TRECVID 2011 MED task than both SVMs and RNNs.
0 Replies
Loading