Probabilistic Distillation Transformer: Modelling Uncertainties for Visual Abductive Reasoning

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: Visual abduction reasoning aims to find the most plausible explanation for incomplete observations, and suffers from inherent uncertainties and ambiguities, which mainly stem from the latent causal relations, incomplete observations, and the reasoning itself. To address this, we propose a probabilistic model named Uncertainty-Guided Probabilistic Distillation Transformer (UPD-Trans) to model uncertainties for Visual Abductive Reasoning. In order to better discover the correct cause-effect chain, we model all the potential causal relations into a unified reasoning framework, thus both the direct relations and latent relations are considered. In order to reduce the effect of the stochasticity and uncertainty for reasoning: 1) we extend the deterministic Transformer to a probabilistic Transformer by considering those uncertain factors as Gaussian random variables and explicitly modeling their distribution; 2) we introduce a distillation mechanism between the posterior branch with complete observations and the prior branch with incomplete observations to transfer posterior knowledge. Evaluation results on the benchmark datasets, consistently demonstrate the commendable performance of our UPD-Trans, with significant improvements after latent relation modeling and uncertainty modeling.
Primary Subject Area: [Content] Media Interpretation
Secondary Subject Area: [Content] Vision and Language, [Engagement] Summarization, Analytics, and Storytelling
Relevance To Conference: Visual abduction reasoning aims to find the most plausible explanation for incomplete observations, and suffers from inherent uncertainties and ambiguities, which mainly stem from the latent causal relations, incomplete observations, and the reasoning itself. Actually, it amounts to a multimodal reasoning problem, where causal relationships encompass visual-visual, linguistic-linguistic, as well as visual-linguistic connections.
Submission Number: 1929
Loading