From Satellite to Street: Semantic and Depth Information for Enhanced Geo-Localization

Published: 2025, Last Modified: 06 Mar 2026IROS 2025EveryoneRevisionsBibTeXCC BY-SA 4.0
Abstract: Accurate positioning is essential for autonomous driving, but localization using 2D maps is challenging due to the domain gap between perspective view and 2D map. While GNSS accuracy is often limited by atmospheric effects, multipath, and signal blockages. We propose a novel positioning method that combines perspective view images with satellite images retrieved based on rough GNSS positions to achieve precise three-degree-of-freedom (3-DoF) pose estimation. Our method leverages the Swin Transformer for satellite image processing and semantic completion for monocular image analysis. By extracting depth and semantic information from monocular images, we convert these to overhead projections, effectively bridging the gap between different viewpoints. This cross-view transformation allows for precise alignment of features from monocular images onto semantically enriched satellite images. Additionally, we integrate a robust global position estimator using the semantic information from satellite images to further enhance accuracy and robustness. The experimental results demonstrate that our method excels in various complex scenarios; we successfully improved the positioning accuracy within 1 m to 80.67% and the heading in 1° to 33.78%. However, longitudinal localization remains more challenging, with higher errors than lateral positioning.
Loading