Abstract: Matching and aligning ground and aerial images are critical for enhancing the accuracy and completeness of 3-D reconstruction. However, significant differences in perspective and radiometric characteristics between aerial and ground images make this task highly challenging. Existing mesh-based approaches often overlook the geometric properties of 3-D points in the structure-from-motion model and suffer from limited track length. To address these issues, we propose a 3-D point-guided matching framework that leverages reconstructed 3-D points to guide the matching between aerial and ground images. Our method introduces a 3-D point-guided transformer to encode point coordinates into embeddings and integrate them into image features, enabling effective correspondence between synthetic aerial views and real ground images. In addition, we design a Transformer-based regression module to refine matching positions within local windows, improving the accuracy of aerial–ground correspondences. Our pipeline reduces matching errors, enables long-track correspondences, and facilitates robust multiview integration. Furthermore, we construct two challenging aerial–ground datasets to validate the effectiveness of our method in city-scale 3-D reconstruction. Extensive experiments on public benchmarks and our datasets demonstrate that our framework significantly outperforms state-of-the-art methods in both matching accuracy and reconstruction quality.
External IDs:doi:10.1109/jstars.2025.3616417
Loading