Abstract: Livestreaming has become a prominent medium for sharing real-time content, including gaming, sports events, financial investment, and various other forms of live entertainment. However, livestreams can be lengthy, often spanning several hours, making it time-consuming and challenging for users to find the most interesting and engaging moments within the content. In this work, we formulate the definition of \textit{Livestream Highlight Segmentation} and propose the first direct Livestream Highlight Segmentation model \textit{AntPivot} which alleviates the challenges of multi-modal fusion, long duration and sparse highlights. Specifically, 1) to accelerate the highlight segmentation research in the domain of insurance and fortune, we release a fully-annotated dataset \textit{AntHighlight}; 2) we introduce a multi-modal fusion module to encode the raw data into the unified representation and model their temporal relations to capture clues in a chunked attention mechanism; 3) we propose dynamic-programming decoding to optimize the detection of highlight clips by searching for optimal decision sequences. The extensive experiments demonstrate that AntPivot outperforms text-only models and achieves state-of-the-art results. Ablation Studies further validate the effectiveness of our methods. All the codes and data will be released publicly with the camera-ready version.\footnote{
Equal contribution.
Corresponding author.}
Paper Type: long
Research Area: Multimodality and Language Grounding to Vision, Robotics and Beyond
Contribution Types: Model analysis & interpretability, Data resources
Languages Studied: Chinese
0 Replies
Loading