Reliability-Guided Gradient Correction for Visible-Infrared Object Detection

Reliability-Guided Gradient Correction for Visible-Infrared Object Detection

ICLR 2026 Conference Submission16961 Authors

19 Sept 2025 (modified: 08 Oct 2025)ICLR 2026 Conference SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Visible-Infrared Object Detection, Semantic Conflicts, Modality Reliability, Gradient Correction, Parameter Attribution

TL;DR: In this paper, we propose RaGrad, a model-agnostic approach that mitigates cross-modal semantic conflicts in visible–infrared object detection via reliability-guided gradient correction.

Abstract: Visible-infrared object detection has attracted increasing attention for its ability to fuse complementary information from visible and infrared sensors. While such fusion improves detection accuracy and robustness, it remains vulnerable to semantic conflicts due to inconsistent object representations across modalities. Existing works typically address these conflicts by aligning cross-modal features or adjusting modality weights using heuristic cues. However, they often overlook modality reliability, which reflects how well each modality captures object-relevant information, resulting in performance drops when unreliable features are used. To address this, we introduce RaGrad, a model-agnostic method for $\textbf{r}$eli$\textbf{a}$bility-guided $\textbf{grad}$ient correction to mitigate cross-modal semantic conflicts. Specifically, we first propose the $\textbf{r}$eliability $\textbf{e}$stimation via $\textbf{p}$arameter $\textbf{a}$ttribution (REPA) module, which estimates the reliability of modality-specific parameters by evaluating their effectiveness via counterfactual reasoning and sensitivity via gradient variation. Second, we propose the $\textbf{r}$eliability-$\textbf{g}$uided $\textbf{c}$onflict $\textbf{r}$esolution (RGCR) module, which resolves cross-modal conflicts by correcting the gradients of less reliable modalities under the guidance of more reliable ones, thereby promoting the learning of more reliable features and enhancing cross-modal consistency. Extensive experiments on three challenging datasets demonstrate the efficacy and generalizability of RaGrad, consistently improving performance across various baselines.

Supplementary Material: zip

Primary Area: applications to computer vision, audio, language, and other modalities

Submission Number: 16961

Loading