Abstract: In this paper, we propose a new Robust Disentangled Counterfactual Learning (RDCL) approach for physical audiovisual
commonsense reasoning. The task aims to infer objects’ physics commonsense based on both video and audio input, with the main
challenge being how to imitate the reasoning ability of humans, even under the scenario of missing modalities. Most of the current
methods fail to take full advantage of different characteristics in multi-modal data, and lacking causal reasoning ability in models impedes
the progress of implicit physical knowledge inferring. To address these issues, our proposed RDCL method decouples videos into static
(time-invariant) and dynamic (time-varying) factors in the latent space by the disentangled sequential encoder, which adopts a variational
autoencoder (VAE) to maximize the mutual information with a contrastive loss function. Furthermore, we introduce a counterfactual
learning module to augment the model’s reasoning ability by modeling physical knowledge relationships among different objects under
counterfactual intervention. To alleviate the incomplete modality data issue, we introduce a robust multimodal learning method to recover
the missing data by decomposing the shared features and model-specific features. Our proposed method is a plug-and-play module that
can be incorporated into any baseline including VLMs. In experiments, we show that our proposed method improves the reasoning
accuracy and robustness of baseline methods and achieves the state-of-the-art performance. Our code and data are available at
https://github.com/MICLAB-BUPT/DCL.
Loading