Degeneracy-Aware Scene Coordinate Regression via Cross-Modal Knowledge Transfer

Yue Yao

Published: 29 Apr 2026, Last Modified: 05 May 2026OpenReview Archive Direct UploadEveryoneCC BY 4.0

Abstract: Global pose estimation from sparse LiDAR point clouds frequently fails in environments lacking distinct geometric constraints. While Scene Coordinate Regression (SCR) provides a robust mathematical formulation for mapping 3D geometry to global coordinates, standard implementations utilize rigid, monolithic architectures. This structural inflexibility often causes capacity interference when networks are optimized across both highly structured urban areas and feature-poor, degenerate zones. To address this discrepancy, we introduce a localization framework that conditions 3D coordinate regression on semantic scene context. By querying pre-trained Vision-Language Models (VLMs) on temporally selected keyframes, we extract open-vocabulary descriptors of environmental layout and structural ambiguity. These multimodal priors are subsequently utilized to supervise a dynamic Mixture-of-Experts (MoE) network, routing 3D spatial features through specialized pathways based on localized scene complexity. To maintain operational efficiency, the VLM-derived contextual knowledge is distilled directly into the LiDAR backbone, ensuring the system operates entirely on point-cloud data during deployment without auxiliary sensor overhead. Evaluations on the NCLT and Oxford RobotCar benchmarks indicate that this prior-guided routing heavily suppresses outlier predictions, yielding a highly resilient localization pipeline under severe geometric degradation.