Hypergraph Multi-modal Large Language Model: Exploiting EEG and Eye-tracking Modalities to Evaluate Heterogeneous Responses for Video Understanding

Published: 20 Jul 2024, Last Modified: 06 Aug 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding of video creativity and content often varies among individuals, with differences in focal points and cognitive levels across different ages, experiences, and genders. There is currently a lack of research in this area, and most existing benchmarks suffer from several drawbacks: 1) a limited number of modalities and answers with restrictive length; 2) the content and scenarios within the videos are excessively monotonous, transmitting allegories and emotions that are overly simplistic. To bridge the gap to real-world applications, we introduce a large-scale Video $\textbf{S}$ubjective $\textbf{M}$ulti-modal $\textbf{E}$valuation dataset, namely Video-SME. Specifically, we collected real changes in Electroencephalographic (EEG) and eye-tracking regions from different demographics while they viewed identical video content. Utilizing this multi-modal dataset, we developed tasks and protocols to analyze and evaluate the extent of cognitive understanding of video content among different users. Along with the dataset, we designed a $\textbf{H}$ypergraph $\textbf{M}$ulti-modal $\textbf{L}$arge $\textbf{L}$anguage $\textbf{M}$odel (HMLLM) to explore the associations among different demographics, video elements, EEG and eye-tracking indicators. HMLLM could bridge semantic gaps across rich modalities and integrate information beyond different modalities to perform logical reasoning. Extensive experimental evaluations on Video-SME and other additional video-based generative performance benchmarks demonstrate the effectiveness of our method. The code and dataset are available at https://github.com/mininglamp-MLLM/HMLLM
Primary Subject Area: [Content] Multimodal Fusion
Secondary Subject Area: [Engagement] Emotional and Social Signals, [Content] Vision and Language
Relevance To Conference: We propose a multi-modal video dataset that includes labels of electroencephalogram (EEG), eye movement signals and audience profiles. Furthermore, we introduce a hypergraph-based multi-modal large language model designed to fully leverage the modalities presented in our dataset for the purpose of video content analysis. These components are highly relevant to the theme of the conference.
Supplementary Material: zip
Submission Number: 457
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview