Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding
Abstract: Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Video $\textbf{S}$ubjective $\textbf{M}$ulti-modal $\textbf{E}$valuation dataset, namely Video-SME. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a $\textbf{H}$ypergraph $\textbf{M}$ulti-modal $\textbf{L}$arge $\textbf{L}$anguage $\textbf{M}$odel (HMLLM) to explore the associations among different demographics, video elements, EEG and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on Video-SME and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/mininglamp-MLLM/HMLLM
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Engagement] Emotional and Social Signals, [Content] Vision and Language
Relevance To Conference: We propose a multi-modal video dataset that includes labels of electroencephalogram (EEG), eye movement signals and audience profiles. Furthermore, we introduce a hypergraph-based multi-modal large language model designed to fully leverage the modalities presented in our dataset for the purpose of video content analysis. These components are highly relevant to the theme of the conference.
Supplementary Material: zip
Submission Number: 457
Loading