$E^3$: Exploring Embodied Emotion Through A Large-Scale Egocentric Video Dataset

Published: 26 Sept 2024, Last Modified: 13 Nov 2024NeurIPS 2024 Track Datasets and Benchmarks PosterEveryoneRevisionsBibTeXCC BY 4.0
Keywords: emotion analysis; egocentric datasets ; video understanding;
TL;DR: We introduced a large-scale collection of a first-person view videos that capture people's emotions.
Abstract: Understanding human emotions is fundamental to enhancing human-computer interaction, especially for embodied agents that mimic human behavior. Traditional emotion analysis often takes a third-person perspective, limiting the ability of agents to interact naturally and empathetically. To address this gap, this paper presents $E^3$ for Exploring Embodied Emotion, the first massive first-person view video dataset. $E^3$ contains more than $50$ hours of video, capturing $8$ different emotion types in diverse scenarios and languages. The dataset features videos recorded by individuals in their daily lives, capturing a wide range of real-world emotions conveyed through visual, acoustic, and textual modalities. By leveraging this dataset, we define $4$ core benchmark tasks - emotion recognition, emotion classification, emotion localization, and emotion reasoning - supported by more than $80$k manually crafted annotations, providing a comprehensive resource for training and evaluating emotion analysis models. We further present Emotion-LlaMa, which complements visual modality with acoustic modality to enhance the understanding of emotion in first-person videos. The results of comparison experiments with a large number of baselines demonstrate the superiority of Emotion-LlaMa and set a new benchmark for embodied emotion analysis. We expect that $E^3$ can promote advances in multimodal understanding, robotics, and augmented reality, and provide a solid foundation for the development of more empathetic and context-aware embodied agents.
Supplementary Material: pdf
Submission Number: 1506
Loading