Homography Estimation With Adaptive Query Transformer and Gated Interaction Module

Zhongyang Li, Faming Fang, Tingting Wang, Guixu Zhang

Published: 05 Apr 2025, Last Modified: 17 Apr 2025IEEE TRANSACTIONS ON CIRCUITS AND SYSTEMS FOR VIDEO TECHNOLOGY, VOL. 35, NO. 4, APRIL 2025EveryoneCC BY 4.0

Abstract: Homography estimation is essential for aligning images captured from different viewpoints by accurately modeling the geometric relationship between them. In homography estimation, global information plays a critical role. To establish global correspondences, cross-attention has been widely used in recent studies. However, vanilla cross-attention mechanisms treat queries in redundant and low-texture areas the same as those in richly textured areas, leading to the accumulation and propagation of erroneous information. We define this phenomenon, where the model excessively attends to queries in redundant and low-texture areas, as query over-focusing. To alleviate query over-focusing and achieve fine-grained homography estimation, we propose a novel homography estimation network, termed AGNet, which integrates an Adaptive Query Transformer (AQFormer) and a Gated Interaction Module (GIM). The AQFormer is designed to dynamically adjust attention by applying a mask to queries, allowing the model to adaptively emphasize feature-rich regions while suppressing redundant or weakly textured areas. Meanwhile, the GIM selectively captures local information by adjusting convolutional kernels based on input, enhancing the extraction of shared features between image pairs. Extensive experiments on various datasets demonstrate that AGNet significantly improves accuracy in homography estimation, particularly in challenging scenarios with low overlap and large viewpoint variations.