Unsupervised person clustering in videos with cross-modal communicationDownload PDFOpen Website

2016 (modified: 03 Nov 2022)VCIP 2016Readers: Everyone
Abstract: In the existing person identification solutions, multi-modal learning is able to gain a plausible person identification accuracy in TV Content since supervised information is applied to train an identification model by explicitly customized labels or implicitly derived labels from the transcripts. However, explicit and implicit information is unavailable in various scenes. To tackle this problem, an unsupervised audio-visual person clustering scheme is proposed via exploring the inherent links of speech and faces. Firstly, deep features for individual audio and visual information are designed with metric criteria for neural networks, to provide powerful representations for person identification. Furthermore, an audio-visual cross-modal communication is built to achieve multi-modal clustering based on the same person concepts. The experiments conducted on TV Content demonstrate the effectiveness and superiority of the proposed solution.
0 Replies

Loading