Cognition-Supervised Saliency Detection: Contrasting EEG Signals and Visual Stimuli

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 OralEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Understanding human assessment of semantically salient parts of multimedia content is crucial for developing human-centric applications, such as annotation tools, search and recommender systems, and systems able to generate new media matching human interests. However, the challenge of acquiring suitable supervision signals to detect semantic saliency without extensive manual annotation remains significant. Here, we explore a novel method that utilizes signals measured directly from human cognition via electroencephalogram (EEG) in response to natural visual perception. These signals are used for supervising representation learning to capture semantic saliency. Through a contrastive learning framework, our method aligns EEG data with visual stimuli, capturing human cognitive responses without the need for any manual annotation. Our approach demonstrates that the learned representations closely align with human-centric notions of visual saliency and achieve competitive performance in several downstream tasks, such as image classification and generation. As a contribution, we introduce an open EEG/image dataset from 30 participants, to facilitate further research in utilizing cognitive signals for multimodal data analysis, studying perception, and developing models for cross-modal representation learning.
Primary Subject Area: [Engagement] Emotional and Social Signals
Relevance To Conference: This submission directly addresses the challenges and opportunities at the intersection of human cognition (via EEG brain-computer interfacing data analysis) and multimedia/multimodal systems. Using novel contrastive learning methods and human-centered design principles, we introduce new automated techniques for learning compact representations of cognitive responses to visual stimuli that can reveal semantic saliency as indicated by the human cognitive system. The learned representations are shown to closely align with human-centric notions of visual semantic saliency, and show promising results in several downstream multimedia analysis tasks, such as image classification and generation. Through experimental validation, we show that our approach without any manual annotations can achieve comparable performance to methods that rely on manual labels. Upon acceptence we will also release an open EEG/image dataset from 30 participants and a code base, for opening new avenues for research in utilizing natural human signals for multimedia research.
Supplementary Material: zip
Submission Number: 2294
Loading

OpenReview is a long-term project to advance science through improved peer review with legal nonprofit status. We gratefully acknowledge the support of the OpenReview Sponsors. © 2025 OpenReview