Highlight Detection in Podcasts: A Multimodal Deep Learning Approach

Published: 2024, Last Modified: 21 Jan 2026ICONIP (9) 2024EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Podcasts have become a pervasive form of digital media, offering diverse content that often spans long hours. However, the vast volume of podcast episodes can make it challenging for listeners to locate the most engaging segments. Speech Emotion Recognition (SER) has witnessed remarkable advancements with the integration of deep learning techniques. This work proposes utilizing deep learning techniques employed in SER to discern emotional cues within podcasts, thereby enabling the detection of highlights. The task is framed as a binary classification problem, where the positive class contains examples of speech segments with high emotional activation. Transfer learning techniques from computer vision and speech recognition domains are applied, utilizing pre-trained models such as ConvNeXt, Vision Transformer, and wav2vec 2.0, which are compared with a baseline Convolutional Neural Network-Transformer hybrid. Additionally, multimodal models are introduced that learn from two distinct modalities: log mel-spectrograms and high-dimensional vector embeddings, both extracted from the raw audio data. The two modalities are combined using (i) a Simple Concatenated and (ii) CentralNet models. Experimental results demonstrate the effectiveness of combining two modalities over a single modality, achieving \(F_1\)-scores of 0.6111 and 0.6270 for the Simple Concatenated and CentralNet models, respectively.
Loading