Mitigating Uni-modal Sensory Bias in Multimodal Object Detection with Counterfactual Intervention and Causal Mode Multiplexing

22 Sept 2023 (modified: 25 Mar 2024)ICLR 2024 Conference Withdrawn SubmissionEveryoneRevisionsBibTeX
Keywords: Multimodal Object Detection, Uni-modal Sensory Bias, Causal Mode Multiplexing
Abstract: Multimodal object detection using RGB and thermal sensors (RGBT) has emerged as a promising solution for safety-critical vision applications that require non-stop operations all day/night. However, there are unsolved issues in multimodal object detection, including uni-modal sensory bias, where models tend to rely on one modality over the other instead of referring to multimodal reasoning. We analyze that training differential multimodal data (i.e., RXTO) on correlation-based symmetrical fusion topology structures provokes such skewed preference. To address this problem, we propose a novel Causal Mode Multiplexing (CMM) framework using the tools of counterfactual intervention. Different from the symmetrical fusion topology of existing methods, the proposed approach leverages two distinct causal graphs based on the input data type. The counterfactual intervention is performed on differential inputs (RXTO, ROTX), while the total effect of the symmetrical fusion topology is learned for common inputs (ROTO). Then, we propose a Causal Mode Multiplexing (CMM) Loss to optimize the interchange between two causal graphs. Overall, the CMM framework enables learning the causality links between the multimodal inputs and predictions, eliminating the uni-modal sensory bias. To assess the effectiveness of CMM, we introduce the ROTX Multispectral Pedestrian (ROTX-MPed) dataset which we will release in public. This dataset mainly includes counterexamples that are not present in existing data. Extensive experiments on KAIST, CVC-14, FLIR, and our ROTX-Mped dataset demonstrate that our CMM framework effectively learns multimodal reasoning and generalizes well on ROTX test data with only training conventional ROTO and RXTO data.
Primary Area: representation learning for computer vision, audio, language, and other modalities
Code Of Ethics: I acknowledge that I and all co-authors of this work have read and commit to adhering to the ICLR Code of Ethics.
Submission Guidelines: I certify that this submission complies with the submission instructions as described on https://iclr.cc/Conferences/2024/AuthorGuide.
Anonymous Url: I certify that there is no URL (e.g., github page) that could be used to find authors' identity.
No Acknowledgement Section: I certify that there is no acknowledgement section in this submission for double blind review.
Submission Number: 5233
Loading