Abstract: Many emerging application areas in video and image processing require large-scale visual concept detection. Examples include content-based indexing of online user-generated videos and 24/7 archival of TV broadcasts. The current state of the art in concept detection uses bag-of-visual-words features with computationally heavy exponential kernel classifiers. We argue that this classifier approach is not feasible for large-scale real-time applications, and propose instead to use combinations of approximate additive kernel classifiers. By using explicit kernel maps and the power mean SVM, followed by fusion of classifiers trained on different features, we achieve high retrieval precision while retaining real-time performance for large sets of concepts. This paper presents a series of experiments with the large-scale TRECVID 2012 video database and the commonly used Fifteen Scene Categories image database. We show significantly improved retrieval performance over standard linear classifiers, and by late fusion over several visual features, the approximative additive kernels outperform any single exponential kernel in only a fraction of the detection time.
Loading