FKA-Owl: Advancing Multimodal Fake News Detection through Knowledge-Augmented LVLMs

Published: 20 Jul 2024, Last Modified: 21 Jul 2024MM2024 PosterEveryoneRevisionsBibTeXCC BY 4.0
Abstract: The massive generation of multimodal fake news involving both text and images exhibits substantial distribution discrepancies, prompting the need for generalized detectors. However, the insulated nature of training restricts the capability of classical detectors to obtain open-world facts. While Large Vision-Language Models (LVLMs) have encoded rich world knowledge, they are not inherently tailored for combating fake news and struggle to comprehend local forgery details. In this paper, we propose FKA-Owl, a novel framework that leverages forgery-specific knowledge to augment LVLMs, enabling them to reason about manipulations effectively. The augmented forgery-specific knowledge includes semantic correlation between text and images, and artifact trace in image manipulation. To inject these two kinds of knowledge into the LVLM, we design two specialized modules to establish their representations, respectively. The encoded knowledge embeddings are then incorporated into LVLMs. Extensive experiments on the public benchmark demonstrate that FKA-Owl achieves superior cross-domain performance compared to previous methods. Code will be made publicly available.
Primary Subject Area: [Experience] Multimedia Applications
Secondary Subject Area: [Content] Vision and Language
Relevance To Conference: This work proposes a novel framework that augments large vision language models (LVLMs) with forgery-specific knowledge for manipulation reasoning while inheriting extensive world knowledge as complementary. Three main contributions to multimedia/multimodal processing are summarized as follows: (1) This work contributes to the advancement of research in multimedia forensics and enhances the authenticity of multimedia content. (2) This work proposes the forgery-specific knowledge augmentation method, which makes up for the shortcomings of LVLMs in combating multimodal fake news. Through learning representations of semantic correlation and artifact trace, this approach endeavors to bolster the model's discriminatory capabilities on subtle cross-modal differences and intrinsic image discrepancies, thereby fortifying its efficacy in distinguishing authentic from manipulated multimedia content. (3) This work can serve as inspiration for other multimodal tasks. establishing domain knowledge and aligning multimodal representations are effective ways to leverage large visual language models.
Supplementary Material: zip
Submission Number: 2509
Loading