Hawkeye: Discovering and Grounding Implicit Anomalous Sentiment in Recon-videos via Scene-enhanced Video Large Language Model

Published: 20 Jul 2024, Last Modified: 05 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: In real-world recon-videos such as surveillance and drone reconnaissance videos, commonly used explicit language, acoustic and facial expressions information is often missing. However, these videos are always rich in anomalous sentiments (e.g., criminal tendencies), which urgently requires the implicit scene information (e.g., actions and object relations) to fast and precisely identify these anomalous sentiments. Motivated by this, this paper proposes a new chat-paradigm Implicit anomalous sentiment Discovering and grounding (IasDig) task, aiming to interactively, fast discovering and grounding anomalous sentiments in recon-videos via leveraging the implicit scene information (i.e., actions and object relations). Furthermore, this paper believes that this IasDig task faces two key challenges, i.e., scene modeling and scene balancing. To this end, this paper proposes a new Scene-enhanced Video Large Language Model named Hawkeye, i.e., acting like a raptor (e.g., a Hawk) to discover and locate prey, for the IasDig task. Specifically, this approach designs a graph-structured scene modeling module and a balanced heterogeneous MoE module to address the above two challenges, respectively. Extensive experimental results on our constructed scene-sparsity and scene-density IasDig datasets demonstrate the great advantage of Hawkeye to IasDig over the advanced Video-LLM baselines, especially on the metric of false negative rates. This justifies the importance of the scene information for identifying implicit anomalous sentiments and the impressive practicality of Hawkeye for real-world applications.
Primary Subject Area: [Engagement] Emotional and Social Signals
Secondary Subject Area: [Experience] Multimedia Applications
Relevance To Conference: This paper introduces a multimodal task called IasDig, which bridges the gap between LLMs in NLP and the multimedia community. It represents the first attempt to incorporate action and object relation information for discovering and grounding implicit anomalous sentiments in recon-videos. To achieve this, we propose tailored model named Hawkeye, which contains a Graph-structured Scene Modeling Module that captures these contextual cues and a Balanced Heterogeneous MoE Module to balance the scene information during the alignment phase with LLMs. Furthermore, we construct two task-specific video instruction datasets: the Scene-sparsity dataset and the Scene-density dataset. Experimental results on these datasets demonstrate the significant advantages of our proposed Hawkeye approach over the advanced Video-oriented Large Language Models. The contributions of our work to multimedia/multimodal processing are as follows: 1. Introducing LLMs to Multimedia: We extend the application of LLMs to the multimedia domain, broadening the research perspective. 2. Action and Object Relation Consideration: By incorporating action and object relation information, we achieve breakthroughs in discovering and localizing implicit anomalous sentiments in recon-videos. 3. Novel Modules: Our novel Graph-structured Scene Modeling Module and Balanced Heterogeneous MoE Module provide new tools and methods for multimodal processing.
Submission Number: 3804
Loading