Abstract: Image captioning, which generates natural language descriptions of the visual information in an image, is a crucial task in vision-language research. Previous models have typically addressed this task by aligning the generative capabilities of machines with human intelligence through statistical fitting of existing datasets. While these models demonstrate proficiency in describing the content of normal images, they may struggle to accurately describe those where certain parts of the image are obscured or edited. Conversely, humans effortlessly excel at it in this case. The weaknesses these models exhibit, including hallucinations and limited interpretability, often result in performance declines when applied to scenarios involving shifted association patterns. In this paper, we present a generic image captioning framework that leverages causal inference to make existing models more capable of interventional tasks, and counterfactually explainable. Specifically, our approach consists of two variants that utilize either total effect or natural direct effect. We incorporate these concepts into the training process, enabling the models to handle counterfactual scenarios and thereby become more generalizable. Extensive experiments on various datasets have demonstrated that our method can effectively reduce hallucinations and increase the model's faithfulness to the images, with a high portability for both small-scale and large-scale image-to-text models.
Primary Subject Area: [Content] Vision and Language
Secondary Subject Area: [Content] Media Interpretation, [Experience] Multimedia Applications, [Generation] Generative Multimedia
Relevance To Conference: This paper proposes using counterfactual causal effects to model the relationship between vision and language.
We employ two counterfactual regularization methods based on the concepts of total effect (TE) and natural direct effect (NDE) to improve image captioning models.
Experimental results consistently show the superiority of our methods over baselines in terms of alleviating hallucination across different backbones and datasets.
The NDE method performs the best in generating faithful captions for counterfactual images and accurately interpreting the most relevant image regions corresponding to a phrase in a caption.
The application of this counterfactual causal effect in image-text tasks like image caption can help improve the interpretability and robustness of vision-language models and further help the community build trustworthy models in the future.
Supplementary Material: zip
Submission Number: 4074
Loading