Multimodal Deepfake Detection with Large Vision-Language Models: The State of the Art

Awais Abdul Khaliq, Stefano Montanelli

Published: 2025, Last Modified: 11 Feb 2026ICIAP (2) 2025EveryoneRevisionsBibTeXCC BY-SA 4.0

Abstract: The integration of large language models (LLMs) with computer vision (CV) has significantly advanced artificial intelligence (AI), enabling machines to better understand and analyze visual data. In particular, the integration of large vision-language models (LVLMs) has improved cybersecurity efforts, especially in deepfake detection. This survey reviews recent developments in the use of Large Vision-Language Models (LVLMs) for deepfake detection, highlighting their emerging role in addressing the limitations of traditional unimodal techniques. We provide an overview of state-of-the-art LVLM architectures such as CLIP, BLIP2, and LLaVA, and analyze how they enable cross-modal reasoning to identify inconsistencies between visual and textual content. The paper also summarizes key benchmark datasets, outlines ongoing challenges, and identifies open research directions, making it a valuable entry point for researchers exploring multimodal approaches to deepfake detection.

External IDs:dblp:conf/iciap/KhaliqM25