Abstract: Due to the widespread impact of social media, efficiently detecting out-of-context misinformation, where a real image is paired with a fake caption has become imperative. Towards this goal, we propose a novel framework FRAUD-Net, which incorporates several unexplored aspects of this task in the model formulation. Keeping in mind that the image and textual evidences collected using the input image-text pair by web search play a crucial role in this task, we build a generalized model, which can handle missing or variable number of evidences, as expected in real-world scenarios. We achieve this by effectively utilizing the common semantic latent space of Visual Language Models and transformer attention blocks. In addition, we observe that explicit domain information not only allows the model to learn better, but also dynamically adjusts to the diverse domains (like politics, healthcare, etc.) during testing. We also propose a mechanism to handle noisy training data by analyzing prediction consistency which helps in model training. Extensive experiments on the large-scale NewsClippings and Verite benchmark datasets showcase the effectiveness of the proposed framework compared to the state-of-the-art techniques for this challenging task.
Loading