Towards Interpretable Multimodal Fact Verification: A Hierarchical Prompting Framework with Large Vision–Language Models

Towards Interpretable Multimodal Fact Verification: A Hierarchical Prompting Framework with Large Vision–Language Models

ACL ARR 2026 January Submission8727 Authors

06 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Fact Verification, Large Vision-Language Models, Explainable

Abstract: The rapid spread of multimodal misinformation on online platforms poses significant challenges to automated fact verification, as textual claims are often tightly coupled with potentially misleading visual content. Existing multimodal fact verification approaches primarily rely on supervised, small-scale models, which exhibit limited reasoning ability and poor generalization in real-world scenarios. Although large vision–language models (LVLMs) demonstrate strong cross-modal understanding, they are not inherently optimized for fine-grained verification tasks and often produce unstable judgments when directly prompted. We propose **H**ierarchical **P**rompting **I**nterpretable **M**ultimodal Fact Verification (HPIM), an interpretable multimodal fact verification framework built on a hierarchical prompting strategy. The proposed method guides a large vision-language model through a coarse-to-fine reasoning process. This is achieved by first prompting a macro-level analysis of claims and evidence, followed by a micro-level, explanation-oriented analysis that leverages structured factual elements. Subsequently, the method fuses textual, visual, and analytical representations to predict veracity. This final prediction is then fed back into the model, enabling it to generate explanations grounded in the evidence. Experiments on a public benchmark demonstrate strong verification performance and improved interpretability. Code is available at: [https://anonymous.4open.science/r/HPIM-74D9](https://anonymous.4open.science/r/HPIM-74D9).

Paper Type: Long

Research Area: NLP Applications

Research Area Keywords: multimodal applications, fact checking

Contribution Types: Model analysis & interpretability, Data analysis

Languages Studied: English

Submission Number: 8727

Loading