What Do Prompts Reveal About Model Capabilities in Low-Resource Languages?

Published: 27 Jan 2026, Last Modified: 18 Feb 2026AfricaNLP 2026EveryoneRevisionsBibTeXCC BY 4.0
Abstract: Large language models (LLMs) are highly sensitive to prompt design, yet benchmark evaluations typically rely on static, hand-crafted instructions that may underestimate true model capability. In this work, we study a reflective prompt evaluation algorithm, GEPA as an inference-time optimization strategy across multiple multilingual benchmarks spanning diverse African languages. GEPA uses a more capable model as a reflection agent to iteratively optimize prompts under strict compute and latency budgets, without updating model parameters. Results show that reflective prompt optimization consistently improves performance across tasks, enabling smaller models to match or outperform larger models when evaluated using optimized instructions. We find that prompt evolution functions as a form of textual policy learning, improving not only task accuracy but also output structure and formatting factors that are critical for reliable model evaluation. Qualitative analysis further demonstrates that optimized prompts elicit more stable model behavior. We characterize the trade-off between optimization cost and inference latency by measuring prompt token growth, and show that modest increases in prompt length can yield substantial gains in performance. Based on these findings, we argue that benchmark evaluations should report both baseline and prompt-optimized results to more faithfully reflect model capabilities, particularly in multilingual and low-resource settings.
Submission Number: 24
Loading