Video and Audio Data Extraction for Retrieval, Ranking and Recapitulation (VADER ^3 )

Volkmar Frinken, Satish Ravindran, Shriphani Palakodety, Guha Jayachandran, Nilesh Powar

2018 (modified: 08 Nov 2022)ANNPR 2018Readers: Everyone

Abstract: With advances in neural network architectures for computer vision and language processing, multiple modalities of a video can be used for complex content analysis. Here, we propose an architecture that combines visual, audio, and text data for video analytics. The model leverages six different modules: action recognition, voiceover detection, speech transcription, scene captioning, optical character recognition (OCR) and object recognition. The proposed integration mechanism combines the output of all the modules into a text-based data structure. We demonstrate our model’s performance in two applications: a clustering module which groups a corpus of videos into labelled clusters based on their semantic similarity, and a ranking module which returns a ranked list of videos based on a keyword. Our analysis of the precision-recall graphs show that using a multi-modal approach offers an overall performance boost over any single modality.

0 Replies