Evaluating Large Language Models on Urdu Idioms

Evaluating Large Language Models on Urdu Idioms

ACL ARR 2026 January Submission1611 Authors

30 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0

Keywords: idiom translation, low-resource MT, automatic evaluation, multilingual corpora, corpus creation, Urdu-English

Abstract: We present a comprehensive evaluation of Urdu--English idiomatic translation, introducing a parallel benchmark encompassing Native and Roman Urdu across translation, paraphrasing, idiom span detection, and idiomatic back-translation. Eight prompting strategies—including literal, cultural, idiomatic, and few-shot prompts—are used to assess multiple open-source LLMs and NMT systems. Performance is evaluated using BLEU, ChrF, BERTScore, COMET, XCOMET, ROUGE, Levenshtein distance, and multilingual embedding cosine similarities (LASER, LaBSE, USE). Results indicate that LLMs outperform NMT systems in preserving idiomatic meaning, with cultural and idiomatic prompts yielding the highest semantic fidelity. Few-shot prompting further improves idiom handling. Native Urdu consistently achieves higher scores than Roman Urdu across all tasks, highlighting the influence of script on translation quality. This study provides the first multi-metric, cross-script benchmark for idiomatic Urdu--English translation, offering insights into model behavior, prompt sensitivity, and the challenges of Romanized input.

Paper Type: Long

Research Area: Multilinguality and Language Diversity

Research Area Keywords: multilingual benchmarks, multilingual evaluation, less-resourced languages, corpus creation, benchmarking, datasets for low resource languages, few-shot/zero-shot MT, multi-word expressions

Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Publicly available software and/or pre-trained models, Data resources

Languages Studied: Urdu, English

Submission Number: 1611

Loading