O-MaMa: Learning Object Mask Matching between Egocentric and Exocentric Views

Lorenzo Mur-Labadia, Maria Santos-Villafranca, Jesus Bermudez-cameo, Alejandro Perez-Yus, Ruben Martinez-Cantin, Jose J Guerrero

Published: 18 Oct 2025, Last Modified: 07 May 2026IEEE/CVF International Conference on Computer Vision (ICCV), 2025EveryoneCC BY 4.0

Abstract: Understanding the world from multiple perspectives is essential for intelligent systems operating together, where segmenting common objects across different views remains an open problem. We introduce a new approach that redefines cross-image segmentation by treating it as a mask matching task. Our method consists of: (1) A Mask-Context Encoder that pools dense DINOv2 semantic features to obtain discriminative object-level representations from FastSAM mask candidates, (2) an Ego↔Exo Cross-Attention that fuses multi-perspective observations, (3) a Mask Matching contrastive loss that aligns cross-view features in a shared latent space, and (4) a Hard Negative Adjacent Mining strategy to encourage the model to better differentiate between nearby objects. O-MaMa achieves the state of the art in the Ego-Exo4D Correspondences benchmark, obtaining relative gains of +22 % and +76 % in the Ego2Exo and Exo2Ego IoU against the official challenge baselines, and a +13 % and +6 % compared with the SOTA with 1 % of the training parameters