Keywords: Urdu NLP; Low-Resource Languages; LLM Benchmarking; Safety and Moderation; Efficient Language Models; Zero-Shot Evaluation
Abstract: Large Language Models (LLMs) have driven rapid advances in natural language processing (NLP); however, low-resource languages such as Urdu, spoken by over 230 million people, remain severely underrepresented, limiting equitable deployment and widening multilingual performance gaps. Existing Urdu benchmarks are fragmented or translation dependent, lacking a unified framework for evaluating emerging efficient models on native, culturally grounded tasks.
We present $UrduBench$, a comprehensive benchmark comprising 20 datasets across 17 tasks for Urdu LLM evaluation, covering natural language understanding, safety-critical moderation, and generation. We also release a modular, open-source evaluation framework enabling reproducible zero-shot evaluation with uniform prompting and metrics.
Using this framework, we benchmark 13 open-weight instruction-tuned LLMs spanning nano (<1B), small (1–3B), and medium (up to 7B) parameter scales focusing on models that are computationally efficient and suitable for deployment in low-resource settings. Results show pronounced performance disparities across model sizes and task categories, with persistent difficulties in Urdu sequence labeling and generation, and consistent gains from larger multilingual models.
Paper Type: Long
Research Area: Resources and Evaluation
Research Area Keywords: benchmarking,evaluation methodologies,evaluation, automatic evaluation
Contribution Types: Approaches to low-resource settings, Approaches low compute settings-efficiency, Publicly available software and/or pre-trained models
Languages Studied: Urdu
Submission Number: 9930
Loading