Urdu-GLUE: A Comprehensive Benchmark and Dynamic Prompt-Based Fine-Tuning for Urdu Language Understanding

Urdu-GLUE: A Comprehensive Benchmark and Dynamic Prompt-Based Fine-Tuning for Urdu Language Understanding

ACL ARR 2026 January Submission5359 Authors

05 Jan 2026 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: Urdu-GLUE, Low-Resource Languages, Natural Language Understanding, Prompt-Based Fine-Tuning, ADAPT, Urdu NLP

Abstract: Language understanding benchmarks have driven significant progress in Natural Language Processing (NLP). However, most benchmarks focus on high-resource languages such as English, therefore leaving low-resource languages underserved. Despite being spoken by over 246 million people worldwide, Urdu lacks comprehensive evaluation resources. To address this gap, we introduce Urdu-GLUE, the first comprehensive benchmark for Urdu language understanding. Our comprehensive benchmark comprises ten diverse tasks, including single-sentence classification, similarity and paraphrase detection, natural language inference, question answering, and sequence labeling. To cover all the tasks mentioned in the benchmark, we created four new datasets: (1) U-CoLA for grammatical acceptability, (2) U-WNLI for Winograd-style coreference, (3) U-STS-B for semantic similarity, and (4) U-XNLI, a preprocessed XNLI dataset. To ensure quality, three native Urdu speakers fluent in English manually verified each dataset. To address the low-resource status of the Urdu language, we also introduced ADAPT (Adaptive Dynamic Prompt Template), the first dynamic prompt-based fine-tuning strategy for encoder-based models. ADAPT systematically explores various prompt templates during training and automatically identifies the most effective for inference. We evaluated multiple fine-tuning (FT) strategies, including standard FT, prompt-based FT, LoRA, QLoRA, and ADAPT, across three experimental settings, i.e., zero-shot, 16-shot, and 80/20 split. Our experiments demonstrate that prompt-based FT methods consistently outperform standard FT in few-shot settings. Our findings provide practical insights for low-resource NLP research. To facilitate future work, we publicly\footnote{https://anonymous.4open.science/r/Urdu-Glue-7D78/README.md} release all datasets, and code.

Paper Type: Long

Research Area: Low-resource Methods for NLP

Research Area Keywords: benchmarking, multilingual benchmarks, less-resourced languages, NLP datasets, evaluation methodologies, few-shot learning, data-efficient training

Contribution Types: Approaches to low-resource settings, Data resources

Languages Studied: Urdu

Submission Number: 5359

Loading