DeMo: Deep Motion Field Consensus with Learnable Kernels for Two-view Correspondence Learning

Yifan Lu, Jiajun Le, Zizhuo Li, Yixuan Yuan, Jiayi Ma

Published: 01 Jan 2025, Last Modified: 31 Jul 2025AAAI 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: As a long-range prior, motion consensus essentially forces the overall spatial transformation between a pair of images to be smooth and consistent, which is naturally well-suited for two-view correspondence learning. However, such precious property remains under-explored by most existing studies due to the modeling challenges posed by the sparsity and uneven distributions of putative correspondences. In this paper, we propose DeMo, a novel and cutting-edge network for outlier rejection, which possesses the capacity to fully capture global motion consensus clues by way of consensus interpolation over the entire high-dimensional motion field generated by putative correspondences. Specifically, through incorporating regularization techniques into a Reproducing Kernel Hilbert Space (RKHS), a concise interpolation formula can be derived for the high-dimensional motion field, which inherently allows a closed-form solution. Subsequently, learnable deep kernels are collaboratively used to flexibly and efficiently capture the relationships between global inputs, thus maintaining the entire motion field consensus. In addition, to remedy the cubic computational overhead of explicit interpolation, a scene-adaptive sampling strategy is introduced, which implicitly selects the more scene-representative motions, reducing the computational complexity of motion consensus interpolation to be approximately linear while maintaining the accuracy. Moreover, to deal with underlying depth discontinuities caused by complicated scene variations, a local consensus complementation block is designed, which maintains local bilateral consensus across both feature and spatial channels. Without bells and whistles, DeMo achieves superior performance in various geometric tasks, including relative pose estimation, homography estimation, and visual localization.