Abstract: Recently, anchor-based detectors can achieve decent performance in multimodal remote sensing scenarios, whereas their anchor-free counterparts fail to reach comparable results. To remedy this problem, we first comprehensively investigate the misalignment issues in multimodal features and detection heads and present a dual-perspective alignment learning (DPAL) framework for multimodal remote sensing object detection. Particularly, we design a cross-modal alignment module (CMAM), which utilizes the multiscale dilation strategy and a differentiable alignment function with channel-wise modulation for cross-modal feature integration. Additionally, to cope with the misalignment problem in regression and classification heads, we propose a task-head alignment module (THAM). It presents a novel pseudo-anchor mechanism, introduces a semi-fixed offset generation strategy to capture task-variant sampling coordinates, and ultimately deploys an offset knowledge transfer mechanism with deformable alignment for anchor-free detection heads. Extensive experiments on four multimodal object detection datasets show impressive results of the proposed DPAL framework. The project code is released at https://github.com/lyf0801/DPAL
Loading