Fake News Classification in Urdu: A Domain Adaptation Approach for a Low-Resource Language

TMLR Paper6975 Authors

12 Jan 2026 (modified: 04 Feb 2026)Under review for TMLREveryoneRevisionsBibTeXCC BY 4.0
Abstract: Misinformation on social media is a widely acknowledged issue, and researchers worldwide are actively engaged in its detection. However, low-resource languages such as Urdu have received limited attention in this domain. An obvious approach is to utilize a multilingual pretrained language model and fine-tune it for a downstream classification task, such as misinformation detection. However, these models struggle with domain-specific terms, leading to suboptimal performance. To address this, we investigate the effectiveness of domain adaptation before fine-tuning for fake news classification in Urdu, employing a staged training approach to optimize model generalization. We evaluate two widely used multilingual models, XLM-RoBERTa and mBERT, and apply domain-adaptive pretraining using a publicly available Urdu news corpus. Experiments on four publicly available Urdu fake news datasets show that domain-adapted XLM-R generally outperforms its vanilla counterpart, while domain-adapted mBERT exhibits mixed results. These findings highlight the varying impact of domain adaptation across multilingual architectures in low-resource settings. We release our domain-adapted models, code, and datasets at URL withheld.
Submission Type: Regular submission (no more than 12 pages of main content)
Changes Since Last Submission: N/A
Assigned Action Editor: ~Soma_Biswas1
Submission Number: 6975
Loading