From My View to Yours: Ego-Augmented Learning in Large Vision Language Models for Understanding Exocentric Daily Living Activities

Published: 10 Jan 2025, Last Modified: 14 Jul 2025arxivEveryoneCC BY 4.0
Abstract: Large Vision Language Models (LVLMs) have demonstrated impressive capabilities in video understanding, yet their adoption for Activities of Daily Living (ADL) remains limited by their inability to capture fine-grained interactions and spatial relationships. To address this, we aim to leverage the complementary nature of egocentric views to enhance LVLM's understanding of exocentric ADL videos. Consequently, we propose ego2exo knowledge distillation to learn ego-augmented exp representations. While effective, this approach requires paired ego-exo videos, which are impractical to collect at scale. To address this, we propose Skeleton-guided Synthetic Ego Generation (SK-EGO), which leverages human skeleton motion to generate synthetic ego views from exocentric videos. To enhance the ego representation of LVLMs trained on synthetic data, we develop a domain-agnostic bootstrapped ego2exo strategy that effectively transfers knowledge from real ego-exo pairs to synthetic ego-exo pairs, while mitigating domain misalignment. We find that the exo representations of our ego-augmented LVLMs successfully learn to extract ego-perspective cues, demonstrated through comprehensive evaluation on six ADL benchmarks and our proposed Ego-in-Exo PerceptionMCQ benchmark designed specifically to assess egocentric understanding from exocentric videos.
Loading