Towards Interpretable Multimodal Fact Verification: A Hierarchical Prompting Framework with Large Vision–Language Models
Keywords: Fact Verification, Large Vision-Language Models, Explainable
Abstract: The rapid spread of multimodal misinformation on online platforms poses significant challenges to automated fact verification, as textual claims are often tightly coupled with potentially misleading visual content. Existing multimodal fact verification approaches primarily rely on supervised, small-scale models, which exhibit limited reasoning ability and poor generalization in real-world scenarios. Although large vision–language models (LVLMs) demonstrate strong cross-modal understanding, they are not inherently optimized for fine-grained verification tasks and often produce unstable judgments when directly prompted.
We propose **H**ierarchical **P**rompting **I**nterpretable **M**ultimodal Fact Verification (HPIM), an interpretable multimodal fact verification framework built on a hierarchical prompting strategy. The proposed method guides a large vision-language model through a coarse-to-fine reasoning process. This is achieved by first prompting a macro-level analysis of claims and evidence, followed by a micro-level, explanation-oriented analysis that leverages structured factual elements. Subsequently, the method fuses textual, visual, and analytical representations to predict veracity. This final prediction is then fed back into the model, enabling it to generate explanations grounded in the evidence. Experiments on a public benchmark demonstrate strong verification performance and improved interpretability. Code is available at: [https://anonymous.4open.science/r/HPIM-74D9](https://anonymous.4open.science/r/HPIM-74D9).
Paper Type: Long
Research Area: NLP Applications
Research Area Keywords: multimodal applications, fact checking
Contribution Types: Model analysis & interpretability, Data analysis
Languages Studied: English
Submission Number: 8727
Loading