Abstract: Visual localization relies on local feature detectors and descriptors to establish reliable correspondences across views. However, existing pipelines typically assume symmetry: the same backbone and feature extractor are used for both queries and maps. This assumption is impractical for real-world deployment. Query-side models must be lightweight to run in real time on constrained devices, whereas map construction can exploit arbitrarily heavy models offline. This asymmetric setting calls for cross-model compatibility between features, rather than uniform processing. While recent works have explored asymmetry for global image retrieval, the local detector–descriptor pipeline remains completely unexplored.
We propose $\textbf{AsymLoc}$, the first framework for asymmetric visual localization.
AsymLoc couples detectors and descriptors through a matching-based consistency loss. Rather than distilling detectors and descriptors separately, AsymLoc supervises the student with the teacher by enforcing agreement on which keypoints across views should match. This cross-model matching supervision jointly aligns detection and description, ensuring that the student learns features that remain compatible with the teacher during asymmetric localization.
Experiments on standard localization benchmarks demonstrate that with AsymLoc, we can deploy a model that is $20\times$ smaller at inference time while achieving near-teacher accuracy at a fraction of the compute cost, substantially outperforming symmetric lightweight baselines.
Submission Number: 8
Loading