Abstract: Abstract. State-of-the-art deepfake detection approaches rely on imagebased features extracted via neural networks. While these approaches
trained in a supervised manner extract likely fake features, they may fall
short in representing unnatural ‘non-physical’ semantic facial attributes
– blurry hairlines, double eyebrows, rigid eye pupils, or unnatural skin
shading. However, such facial attributes are easily perceived by humans
and used to discern the authenticity of an image based on human common sense. Furthermore, image-based feature extraction methods that
provide visual explanations via saliency maps can be hard to interpret for
humans. To address these challenges, we frame deepfake detection as a
Deepfake Detection VQA (DD-VQA) task and model human intuition by
providing textual explanations that describe common sense reasons for
labeling an image as real or fake. We introduce a new annotated dataset
and propose a Vision and Language Transformer-based framework for
the DD-VQA task. We also incorporate text and image-aware feature
alignment formulation to enhance multi-modal representation learning.
As a result, we improve upon existing deepfake detection models by integrating our learned vision representations, which reason over common
sense knowledge from the DD-VQA task. We provide extensive empirical
results demonstrating that our method enhances detection performance,
generalization ability, and language-based interpretability in the deepfake
detection task. Our dataset is available at https://github.com/RealityDefender/Research-DD-VQA.
Loading