CODE-PROMPT: Evaluating Code-LLMs for their NLP Abilities via Code Aligned Prompts

ACL ARR 2026 January Submission11 Authors

21 Dec 2025 (modified: 20 Mar 2026)ACL ARR 2026 January SubmissionEveryoneRevisionsBibTeXCC BY 4.0
Keywords: code models, evaluation, zero-shot, few-shot, multilingual, numeric reasoning
Abstract: Large language models pre-trained on code (Code-LLMs) have achieved remarkable performance on coding tasks. Despite their success of processing programming language, their application to natural language tasks has largely been limited to narrow tasks and ad-hoc model selection. In this paper, we present \textsc{Code-Prompt}, a comprehensive prompt-based framework to benchmark Code-LLMs over ten NLP tasks. We evaluate \textsc{Code-Prompt} on 13 Code-LLMs using code-aligned prompts and compare them with their parallel natural language LLMs (NL-LLMs) using natural language prompts. Our results show that Code-LLMs perform on par with NL-LLMs across a range of NLP tasks, while producing more format-consistent generation with less redundancy. Our cross-lingual experiments further indicate that knowledge in Code-LLMs is well shared among different programming languages, whereas performance across different human languages still varies significantly. All our code and data will be made publicly available at \url{anonymous_url}.
Paper Type: Long
Research Area: Code Models
Research Area Keywords: code language models, code completion, multi-language code models, evaluation of code models, prompting
Contribution Types: Model analysis & interpretability, Approaches to low-resource settings, Approaches low compute settings-efficiency, Data resources
Languages Studied: Arabic, Bulgarian, German, Greek, English, Spanish, French, Hindi, Russian, Swahili, Thai, Turkish, Urdu, Vietnamese, Chinese
Submission Number: 11
Loading