Keywords: Robot perception, Multimodal alignment, Gromov-Wasserstein distance
Abstract: Contrastive objectives such as InfoNCE align multimodal representations at the instance level but are unable to keep intra-modal geometries, which is called a structural alignment gap. We propose UniOMA, a multimodal structural alignment method using Gromov--Wasserstein (GW) barycenter regularizer to align each modality to a shared structural consensus, scaling linearly to 3+ modalities.
Experiments on five robotic benchmarks (vision, force, depth, audio, tactile, proprioception) show consistent improvements in downstream tasks like regression, classification, and cross-modal retrieval.
Submission Number: 52
Loading