Abstract: Visual localization aims to predict the absolute camera pose for a single query image. However, predominant methods focus on single-camera images and scenes with limited appearance variations, limiting their applicability to cross-domain scenes commonly encountered in real-world applications. Furthermore, the long-tail distribution of cross-domain datasets poses additional challenges for visual localization. In this work, we propose a novel cross-domain data generation method to enhance visual localization methods. To achieve this, we first construct a cross-domain 3DGS to accurately model photometric variations and mitigate the interference of dynamic objects in large-scale scenes. We introduce a text-guided image editing model to enhance data diversity for addressing the long-tail distribution problem and design an effective fine-tuning strategy for it. Then, we develop an anchor-based method to generate high-quality datasets for visual localization. Finally, we introduce positional attention to address data ambiguities in cross-camera images. Extensive experiments show that our method achieves state-of-the-art accuracy, outperforming existing cross-domain visual localization methods by an average of 59\% across all domains. Project page: https://yzwang-sjtu.github.io/CDG-Loc.
Lay Summary: Teaching agents to understand where a photo was taken — known as visual localization — is crucial for technologies like augmented reality and robot navigation. However, most current methods only work well with images from a single type of camera and in controlled environments. In real-world scenarios, images often vary due to differences in lighting, camera devices, or moving objects, making the task much more challenging.
To address this, we developed a new approach that generates more diverse and realistic training data, helping agents better cope with such variations. First, we build a 3D scene model that captures changes in lighting and filters out distractions like people or vehicles. Next, we use a text-guided image editing tool to synthesize rare or uncommon images that are often missing from existing datasets. Finally, we design an anchor-based strategy to produce high-quality and diverse datasets, and introduce a position-aware attention mechanism to resolve ambiguities in images captured across different devices.
With these improvements, our method achieves state-of-the-art performance in cross-domain visual localization, paving the way for its practical application in complex real-world environments.
Primary Area: Applications->Computer Vision
Keywords: Cross-Domain Visual Localization; Image Generation; 3D Gaussian Splatting; Image Editing
Submission Number: 9987
Loading