Abstract: Pre-trained vision-language (V-L) models exhibit significant generalization capabilities in detecting rumors. However, their reliance on single-modality prompts—either language or vision—limits their flexibility for dynamic adjustments in both representation spaces during rumor detection. To address these limitations, we propose a multimodal rumor detection framework that uses prompt learning in both the vision and language domains to align their representations better. Inspired by recent advances in efficiently tuning large language models, we introduce a set of trainable parameters in the input space, keeping the model backbone frozen. Additionally, we use distinct prompts at various early stages, which helps progressively model the relationships between features, enhancing comprehensive context learning. Extensive experiments with two real-world multimodal datasets demonstrate our framework’s superior ability to distinguish rumors from facts.
Loading